Pentagon 911 Lessons Learned – IT Disaster Recovery Plan
Bill Alderson discusses the technical and IT Best Practice lessons learned from the events subsequent to the Pentagon 911 disaster. Several key IT Disaster Recovery Plan lessons can be learned from this video.
Transcript for this Video
Good afternoon. This is Bill Alderson coming to you from beautiful Austin, Texas. I hope everyone is doing well today. I’m going to have a short presentation for you. It’ll be under 30 minutes in respect of your time.
We’re going to talk a little bit about the Lessons Learned 10 Years Later, from the Pentagon 9/11 disaster. I was privileged to be one of those that got called. We all remember the event and probably where we were, what we were doing, who we were with, how we responded. Our world was kind of devastated for a while. A couple of days later I got a call from a Pentagon General asking if we wouldn’t come in and bare a hand in helping them recover communications. It was obviously a very proud time in my life to be prepared and to be asked to serve in that capacity.
We responded and came in and helped out as best we could. We analyzed a lot of various problems. We took a look at various communication circuits, network management systems, security infrastructure, pretty much the whole gamut of IT. We analyzed, assessed, looked at, and prepared a 100-page report for them so that they’d know the results of our analysis and then how to proceed with that information.
My organization serves large-scale environments, the Global 2000, military, government. In the last 10 years, we’ve gone into Iraq a half a dozen times and Afghanistan and helped diagnose problems across the entire war environment, and then back to the Pentagon to analyze other things. That’s the nature of our service. It’s high visibility, large-scale critical problem type environments.
In the last 20 years of diagnosing critical problems down to the technical root cause, we’ve learned a few other things. We’ve learned that it’s not just the technical problem that we want to avoid, but the processes which lead to the ability of those technical problems to exist. Essentially, our root cause is changed from technical root cause back into the best practices that obviate those problems. We look back and find what processes were omitted or not in effect.
Typically I’ve got a CIO whose enterprise was just melting down and we came in, we diagnosed the problems, mitigated the problem, and of course they have to ask, “Well what can we do to prevent?” Not only in the last 20 years have we been diagnosing critical problems, but we’ve been analyzing what kinds of procedures, processes, and systems allowed those critical problems to exist. That’s kind of the way that we’re moving forward with the new Apalytics Corporation, to help people mitigate and identify those things that they can do that will avoid crisis and avoid problems.
I hope you came to learn a little bit and to just see my perspective and that sort of thing. Now, what we’ve done is we’ve kind of taken all of these lessons learned and we call this Technical Systemization. There are pillars to technical systemization. We’re going to go through some of those pillars today because they directly relate to the events of 9/11 and pretty much any other type of disaster. Taking the lessons learned and applying them.
Now, Technical Systemization is very much an implementation. ITIL is a model. ISO has models. We have protocol models. We have the OSI model. Well, the OSI model is just that, it is a model. It is not an implementation. I want to make and draw a distinction between that of a model, like ITIL is a model, the OSI model is a model, but an actual implementation, such as IPX, TCPIP, or Apple Talk. Those are true implementations. They look different from the model because they are applied to your exacting problem and your exacting environment in such a way as to have peace to make things truly work.
By changing the name of your organizations to model, to be the same as, for instance the ITIL model, is not necessarily implementing the technical systemization that will allow that to work. That’s part of our message. Each institution needs to have their own Technical Systemization, their implementation of ITIL. ITIL is not a protocol like TCPIP. TCPIP is an implementation of a model, but not the model itself. That’s not the implementation.
One of the things that we’ve learned is that there are a bunch of silos out there. The Pentagon has silos. A lot of them have demarks for who runs the network, desktop, server, platform, network management, and security. All of these are one of many silos. What we call Business Technology Integration is the ability to do cross-silo collaboration, whether it is system design, metrics and monitoring, or network documentation.
Let’s just take network documentation for a moment. As I travel around and look at various places I walk and invariably I see diagrams on the wall. Sadly, most of them are labeled 2008, 2007, 2010 and they are typically dated and that sort of thing. I don’t know why they’d still be on the wall. Nevertheless, it’s probably because they haven’t been updated. It’s an interesting thing. If you take all your business silos, desktop silo, network silo, application and platform silos and with each one of them build a network diagram or a system diagram, if you were to take those and lay them horizontally from the client to the server and the application, could you put those things together and say, “Okay, here’s the desktop, here’s the network, here’s the WAN, here’s the security infrastructure, here’s the platform, and here’s the application and all the dependencies”? Could you put your finger on the diagram on the left and give a troubleshooter and end-to-end diagram? I don’t think so.
Case in point, when we arrived at the Pentagon, we came into a room and we started, “Okay, first thing, we need to know a little bit of the lay of the land.” There was kind of a gulp and they said, “We have a great network documentation system, however, it was one of the servers impacted in the disaster.” They didn’t have anything printed out because it was online for ready reference. Don’t forget, sometimes those systems go down. It’s probably a good idea to have the recent version printed out somewhere so that you can refer to it.
We started with a blank white board and started diagramming. Who had to do that diagramming? Well, the key technologists that had a lot of other fires burning. We came in to help. Because these key technologists had to help understand so that we could bare a hand and start helping them so you’d have more people helping, they were out of their key duties. It’s incumbent to make certain that your key technologists are documenting the systems and that you are training everyone so that everybody knows the key elements of the system.
We’re going to talk a little bit about that, but that’s one of the examples of how organizational silos need cross technology collaboration teams to pull together system documentation that’s congruent. It gives you an end-to-end picture in network monitoring metrics and that sort of thing.
Back to The Art of War, Sun Tzu and The Art of War is one of the books that are studied by most military officers at one time or another. The bottom line to this is ‘Know Yourself’ and ‘Know Your Enemy’. That’s a fundamental thing. If you know yourself, you can react to things. In the area of knowing yourself, we need to develop end-to-end cross technology, cross-silo documentation.
We have literally hundreds of people who are willing participants in optimizing our network systems and applications for the end users, but I find secretly that some of the people who come to the table a little bit later, the people who’ve come in the last few months, they’re not going to come and tell you, “Hey, I’m really impotent because I don’t know where anything is. I don’t have a diagram. You don’t have a diagram.” They’re not going to come tell you this. It is one of the biggest lost production areas that your organization has. You have hundreds of people who are willing participants, but yet are impotent to participate because there is very little true end-to-end cross technology, cross-silo documentation.
When there’s a trouble ticket they can, first of all, identify the client, put their finger on the client network. Then they can start moving to the right and into the network and see the network infrastructure that they’re dependent upon. Then they can move their finger to the right and see that firewalls, and load balancers, and WAN optimizers that they’re dependent upon for that system to work. Then they can move over to the right and look at the data center, the data center distribution, into the virtual platform, into the virtual image that they’re running, then into the virtual switch, then on into that process that they’re connecting to, then back into the application, and then the application having secondary responsibilities or geared capabilities. Now you have the ability for a help desk person to kind of understand your infrastructure and very rapidly assimilate and at least triage where the problem may or may not be.
Then you can see what- Network Management. You can say, “Test Point One. Test Point Two. Test Point Three. Test Point Four.” You can visualize your infrastructure. Probably the number one thing that I would stress is that you must know yourself. If you don’t have that, it’s costing you orders of magnitude more than you think and a lot more than it would cost you to put these systems in place.
Now, once you have that cross technology, cross-silo documentation and everyone can meaningfully participate, then you make sure you train everyone on the design constraints, your architecture, and certify them on your system architecture, your abilities, your capabilities, your constraints of your architecture so that they’re not trying to go in and modify to make it look like where ever they came from previously. You want your architecture to be pure and to work systematically with all your various components. It’s important to educate everyone on that part.
Now, one of the things, and an example of something that we’ve done inside large scale environments, is to help people to develop scalable cross location, cross domain data collection capabilities. Today, you have so many firewalls, load balancers, locations, different domains, different security levels, etc, that you can’t just go out and discover your network and then map it. It’s very difficult. If someone can, usually they can do it in one area, but not in another area.
Our philosophy is you pull all that information from a variety of sources into a single repository across the organization. That becomes what we call, Rapid Network Rendering Database. That database becomes a repository and you create a SOA, over to the right there you’ll see a SOA, so that other corporate systems can come in so that you can export to your network management systems to make certain that you’re monitoring all of the various dependencies of applications and systems.
Over on the left, when data goes into that RNR database, one of the things that we do is we make sure that it has the correct endpoints on it. When those systems have the correct endpoint, when those objects, your network management objects, your router port objects, your switchboard objects, have ends placed on them, you can do reconnection of those objects. When they are simply an object and don’t have two endpoints, it’s really difficult. Just populating a database without those endpoints connected on both sides isn’t going to do you any good. You’re not going to be able to export them.
We’ve designed a way of, when that stuff comes in, we put the endpoints on them and then we connect the points and populate a Visio diagram so that you can see the details of your network infrastructure port by port logically in a visual way to get the VLANS and all the virtual machine information so that a troubleshooter can truly see the environment that they are trying to troubleshoot. That’s just one particular way that we like to see things done.
Now, the second thing is know your enemy. Your enemy is a Root Cause of the problems that you have. You want to perform formal root cause analysis whenever possible. If your helpdesk solution is to always reimage, reboot, or reload, that’s problematic. They should take a moment. They should take just the opportunity for a moment to make sure . . . I’m not saying leave the problem go forever, but I’m saying resist the temptation to reboot a server or even an end user workstation until you’ve had the opportunity to collect a little bit of data. That data doesn’t have to be analyzed right there at that moment, but can be put together as a collection of information, trace files, etc. to then do analysis retrospectively. A lot of folks have some of those capabilities, but I’m saying that you need to involve your helpdesk in capturing those sort of things so that the recurring problems are able to be diagnosed and then obviated.
Hey, the best solution to any problem is not to have it in the first place. Unfortunately, few are motivated with such things. Management needs to know how many trouble tickets you’ve closed and how many problems you’ve solved. Well, wouldn’t it be great to say, “Because we were so proactive, we obviated the problems”? Of course, at that point you lose your justification for the tools, systems, and people. Truly, that should be the objective of any senior management person, to obviate and have all your technologists actually there, but kind of like the Maytag repairman without a lot to do, but when the problem occurs they’re there to help.
The other part of it is the ITIL problem management analysis, going about finding out your statistical number problems that you have. At the Pentagon, when we arrived there, they were getting right about 100,000 SNMP traps a day. Well since then, they’ve started a filtering mechanism. It was so overwhelming, 100,000 a day, that they were not able to respond to the ones that were the most critical and severe. They learned that, “Okay, if you’re going to have 100,000 SNMP traps a day, you’d probably better figure out which ones are your most material, which ones are your biggest ones and how to go about taking a look at that.”
Now, the third thing that I want to talk about in Problem and Change Management is that I go into organizations and I find something that’s amiss and all we have to do is change a few parameters or repurpose a couple of existing systems or capabilities. Invariably, they look at me and they say, “Well, we have no way of promulgating this optimization. We have no way to put this into change control and to have a fix. We either have to have an emergency crisis that’s burning down the building or we have to have a multimillion dollar procurement in order to promulgate and have a set of steps in which we can do change.” My recommendation is to make sure that you have a way to implement free changes that are not procurement or catastrophic kind of response changes. There are many free ways of improving things.
I was at a site a few weeks ago. We went in and their entire Internet access was very slow. We went in and diagnosed the exact cause of the problem. We tested. We implemented, proved that it mitigated. They looked at me and they said, “We have no way of implementing this change to thousands of our users.” They have no way of linking it unless they’ve made a three or four million dollar procurement to any ability to change their infrastructure. Anything that would be cost free has no mechanism by which you can implement and repurpose your systems with just a few process changes. I’d recommend that you think about that. Make sure you can do things that are free.
The other thing is, is there a Master Plan? Now, after the Pentagon disaster, these guys were prepared. They had master plans. They had in-the-event-of plans. They had future plans that mapped the way forward. They took our documents, other things that we learned, and a fresh realization of the threats and vulnerabilities and they integrated that in rapidly. They took that forward. Congress actually allocated a half a billion dollars to beef up the vulnerabilities of the Pentagon. That wasn’t because they, Congress, wanted to spend money. It was because the people who managed the Pentagon network infrastructure had an articulated strategy and they knew where their threats and vulnerabilities were. They had it articulated in a well-articulated plan that allowed them to move that forward and to get champions within Congress and within their leadership in order to accomplish these feats.
If you don’t have a Master Plan, what you’re doing today, your ‘is’ environment, and then you have a disaster, what’s going to happen is you’re probably just going to rebuild in the same way that you were before. That’s, of course, going to be a few generations behind. In the way that we talk about it today, if you’ve seen those insurance commercials, it’s like, “Man, if your car is totaled we’re not going to give you the same 1974 Pinto station wagon. By golly, you’re going to get a 1975 Pinto station wagon in replace of your old car.” It’s the same thing here. If you have a well-articulated plan and there’s an emergency, a problem, or an unfortunate situation and you have to rebuild part or your entire infrastructure, if you have this plan, it’s an excellent place to start. It’s a good thing to do.
The next thing is Decision Support Metrics. Well, remember all those systems that were given 100,000 SNMP traps a day? Which ones are important? Which ones aren’t? Well, those were all SNMP traps of yesteryear, the red-green, not working or working. Today, it’s more about performance. What we call Decision Support Metrics is developing a collaborative cross silo management capability validating your end user application capabilities with signatures so that your monitoring systems should be validating any potential changes. Talk about change control. If your monitoring system is monitoring all of the vital signs of your key mission critical applications and you make a change it’s going to tell you, boom, immediately, that you have a problem affecting one of your end user applications or capabilities. That takes a lot of care and feeding.
The secondary thing about good Decision Support Metrics is that if you know right where your users are slow and things that you obviously can’t optimize without a project… You have a whole list of projects that are out there that you would like to spend on. If you spent in every area: storage, network, and every part of your infrastructure- your budget would be depleted rapidly. In today’s day and age, you use Decision Support Metrics to identify your bottlenecks so that your portfolio spending can be pin point accurate. You can take what you learn from those metrics and you can apply them to your budget allocation, your resource allocation and you know that you are going to get a return on investment that is worthy of the money that you’re spending.
Today’s CIOs have an incredible problem. They’re asked to upgrade in a forklift fashion across the board. It’s cost prohibitive. You can’t do it. Even more so today, we need to have Decision Support Metrics that are helping us prioritize exactly where we’re going to spend the money and quantify the return on investment by indentifying end user signatures, by identifying the dependencies and the process flow of the application, especially when you’re starting to talk about cloud. That gives me a whole other thing to talk about and we’ll do that another day. Now we have dependencies and the vulnerabilities of moving to cloud, what about cloud, and then cloud to cloud. How are you going to analyze those problems? We’ll talk about that another day.
Okay, so performance indicators. You’ve got NetFlow to go back and forth for rate and flow information to find out if certain applications are working so you have network flows. You have device status. That’s your SNMP. What’s going on with your components? Do you know that your executives and your business leaders do not care about the status of a router, a switch, or something? What do they care about? They care about the capabilities and the abilities of their people.
In the war environment, commanders don’t care that this link is down or that link is down. What they want to know is, “Can we carry on certain command functions?” We have to translate our performance indicators from what we’re comfortable with, which are router ports and switch ports and links, into to what our executives and our end users care about. Where am I impacted? Who is impacted? What applications and what capabilities do I have or not have to run my business or my command on?
Knowing what that path is between the client and the server, and the response time at the server, and then the latency going back and forth, and the ability to capture those transactions across your network, is very important.
In talking about multi-points and multi-tier I’d like to just mention that when you have good Decision Support Metrics, you can see that the process, in other words the CPU, of your web tier or the CPU of your app tier, or your SQL tier or mainframe tier, or your security authentication is consuming most of your response time. By having that multipoint, multi-tier transaction analysis, the ability to know exactly what is slowing down your end user transaction that allows you to spend in the right areas, diagnose in the right areas, and serve your end users well.
Now, back to the Technical Systemization, you can go on our website. We’ve got this model on our website of our Technical Systemization. There are thousands of things that you need to do. There are hundreds of procedures and processes, but these six things we believe. If you do these and you get these built into your culture of your organization, that people can remember them, they can focus on them. It’s not too big. Saying, “We have to implement all of ITIL,” is overbearing. “All the ISO standards”- overbearing. When you say, “Hey, our organization is going to implement these six fundamental things,” and they become the pillars of your ability to operate… These are the things that we believe are important.
Just a little bit about the company. Some of you know me and have worked with me before. Those who don’t, I look forward to that potential opportunity. We are dedicated to Crisis Avoidance. We want to help you eliminate the problem through the integration of Best Practices in your organization. If you have a problem, our Assessment Services are here to provide you immediate response, Technical Services to help you resolve problems, mitigate issues, and then to perform assessments to find out where you might be vulnerable. If you’re in a proactive state, if you’re looking at spending new money on infrastructure or Decision Support Metrics, we can come in and help you assess and help you build those things out. Our Cornerstone Services will come in and help you with an embedded analyst so that they can help you promulgate these types of virtual teams into those various areas and improve and benefit your entire infrastructure.
In summary, documentation, Documentation, DOCUMENTATION. Maintain a highly skilled technical staff. You may have a lot of vacancies. The prevailing thought is that those vacancies are more valuable than the average employee. In other words, it’s better to have a vacancy than it is someone who is of average. If you have average people, you need to build them into highly skilled technical people so that they can materially… Part of that is the CIO’s job to make sure that they are empowered. Have that Master Plan, employee problem management, and change control, dynamics, Decision Support Metrics that drive your portfolio spending, Business Technology Integration to make sure you have a collaborative environment.
Then those virtual teams that help you. Virtual teams are implemented without any reorganization. If your organization is ineffective, you apply virtual teams to address certain problems and pick the right people from each one of those silos and you can change your organization and improve things without even having to reorganize. Then finally, be an end user advocate. Make those end users your priority.
I’m Bill Alderson. It’s been nice meeting with you today. I respect your time and energy. I appreciate your coming along today. If you have any questions or comments please email me at [email protected]
I appreciate you stopping by today and we look forward to talking to you in the future.
I’m out, Bill Alderson.
Pentagon 911 Lessons Learned – IT Disaster Recovery Plan Video