Enterprise Architecture IT Disaster Recovery Plan Steps to APM Under Pressure

This session takes ideas from many IT disaster critical problem resolution IT consulting engagements and puts them together into 30 minutes on how to avoid IT Crisis. This presentation identifies IT best practices that can be used when creating an IT Disaster Recovery Plan, particularly if the disaster affects application performance. Your staff will appreciate the references on how to troubleshoot network problems during an IT disaster.

Enterprise Architecture defines a multi-site, multi-vendor network with multiple computer operating systems integrated by the benefiting institution, supporting multi-tier mission critical applications, having its own monitoring and support organizations. Compare your internal processes to the set of best practices known to obviate crises for inclusion in an IT Disaster Recovery Plan

Transcript for this Video

Good afternoon. This is Bill Alderson. I appreciate you joining us this afternoon for just about 30 minutes or so, as we talk about six things that you can do. In many, many years of performing critical problem resolution, going in when nobody else was able to solve it, and high stakes situation, high complexity, lots of people waiting for you to fix the problem type of thing, my mind always goes to the end of the engagement when everybody stands around and says, “Oh, that was a pretty simple solution.” Typically, there are multiple things that go wrong in a crisis or a critical problem. When you go back and reconstruct or do the anatomy of a critical problem, you always find that the root cause is some technical problem, of course, typically or configuration related or something of that nature.

After we’re done doing the critical problem resolution, everybody says, “Well, why did that occur? How did that problem get us to this debilitating point? How can we prevent it in the future?” That’s been my guiding light to help people, yes, go in and help them when they do have a problem. We do that. If I can obviate problems through best practices, the things that you can do to avoid a problem, that’s much better than having a problem that you have to resolve. I think we’d all agree.

Today I’m going to talk a little bit about six of the things that you can do. Of course, there are hundreds of things that we can do, literally hundreds. There are best practices, ITIL, standards of various types, certifications of various types, and all sorts of things that we can do, but we can’t remember all of them all the time, every single day. I’m going to bring to you six of the things that you can do to avoid crisis, to avoid critical problems, and to avoid problems in general. That’s what we’re going to talk about.

This is our website here on the top left. You’ll see we’ve got a variety of services. We’ve taken the Jefferson Memorial and we’ve used it and the pillars in the Jefferson Memorial to illustrate our services and the steps to illustrate other services and that sort of thing. If you have a chance some time, you ought to hit that “Start Here” button on the front of our Apalytics web page and go through and just listen to see how the web page works and that sort of thing, and to look at our collaboration systemization model. It’s pretty cool.

Then over on the right we’ve got CIO Tech and we’ve got SynSynAck.com. Syn Syn Ack, of course, is the TCP three-way handshake and we talk about performance analysis and that sort of thing. Then CIO Tech, we talk about things that are of interest to the CIOs. That’s a little bit about us.

Little bit about me, I started as a communications engineer at Lockheed Missiles and Space Company. If you read Steve Jobs’ book, not his book, but the book by Walter Isaacson that just came out, which I highly recommend because it really does a good job of explaining how Steve Jobs and his company, his problems, the adversities and that sort of thing drove this guy to build some really excellent products that the market really loved. In that book, he talks about how Lockheed, Fairchild and the other companies in the Silicon Valley, the defense money, and the defense contracting that was in there actually spawned the high tech industry and chip building and that sort of thing. Of course, I can remember when we did multi-layer boards over there and wire wrapping. That was the beginning of printed circuits. Then they just kept getting smaller and smaller.

I joined Network General in the mid ’80s, and we started working on the Sniffer. That’s when I really got into protocol analysis and critical problem resolution. It was in ’86, ’87. Subsequent to that, I started Pine Mountain Group. Some of you may know me from that. I created the Sniffer Training program. I licensed that to Network General and they trained thousands of people. I also crated the certified net analyst program and I trained about 50,000 people in 22 countries, of whom we certified over 3,000 people as certified network forensic professionals. During that time, and in my career, I’ve been able to provide service or training to 75% of the Fortune 100. I’ve been around in a lot of various environments.

I sold Pine Mountain Group to Net QoS, a performance management company, in 2005, where I was one of their technology officers. Then Net QoS sold out to CA Technologies in 2009. I hung out for just a little longer than the founders did. Then I decided that I had another company in me so I started Apalytics Corporation to help people solve problems and learn from those problems to provide and implement best practices so that we could obviate those problems.

Here we are. Six things we can do. Six things we can remember. I put this under technical systemization. Technical systemization involves things going all the way back to the IETF. The first RFCs were not about how technology was going to work but how we were going to collaborate to build the most important technical resource today that we have, which is the internet, interconnecting all of us and providing a platform for business and communities to develop across an electronic system. Technical systemization is how you’re going to collaborate. How are you going to do all of these various things? I’ve taken what I believe to be the things that should be utmost on your list of things to do to avoid crisis. These are our suite of IT best practices.

It doesn’t really matter which order they’re in, but first up is Decision Support Metrics. People are out there buying millions’ and probably billions of dollars’ worth of network management tools, analyzers, all sorts of application agents and gizmos and gadgets aplenty, to basically try to figure out what’s wrong or to be notified about when things are going wrong, and that sort of thing. I am finding that even though we’ve spent an inordinate amount of money and time and energy, that few people are really using the tools.

The proclaimed experts today are the people who install it and keep it updated. Well, that is a very important part. You have to install the equipment, keep it updated in the versions, and then you have to have users who can go in and connect and click and get the reports. I’m finding that there are very few people who are truly doing analysis on these very expensive tools. Few are really looking at their enterprise, their environment, their applications and the signatures of their applications, the dependencies and that sort of thing.

I call this Decision Support Metrics because those metrics developed by operating system details and monitoring tools and metrics, whether it be SNMP, NetFlow, response time management to performance management, agent tools that are inside the actual platforms, cloud based statistics, etc. If you’re not really looking at what your dependencies are and tuning those systems to look at your specific applications and your dependencies, and I’m going to talk a little bit about problem management later, but integrating that with problem management, finding out what problems are causing you issues the most, it’s a very important thing to take and use those systems and turn them into Decision Support Metrics.

Decision Support Metrics means that there’s actionable information in those statistics. You can tune your pages and your systems to go out and say, “Okay, I’ve got this application that we really need to serve the application owners here. We need to really monitor this application. When people have problems with it or there’s capacity issues, we have the statistics and a base line and we’re watching each one of our major applications so that we know what the next step is going to be.” It’s very important to do that.

Here are some of the things with Decision Support Metrics. First of all, you’ve got the desktop and then you’ve got the server, you’ve got all of the various components behind the server which is the applications, and the virtual platforms or the cloud platforms, the databases, the multi-tier capabilities. You have to watch all of that. You really need to architect around those sorts of things. Look at those application data flow and dependencies.

In order to be able to really monitor well, you have to know where your test points are. Back when I was at Lockheed, if there was a problem with a printed circuit card, our multilayer board, we had test points on the board. You could go to those test points and perform a particular measurement to see if the health of that system was up to par. You could measure it, adjust it and that sort of thing, but you had to go to those test points. Well, our large mission critical networks have test points in them if we take and document them and lay them out schematically so that you can see those test points. That’s how you know how to build your Decision Support Metric systems is because you have good documentation and you can see it.

It’s difficult to do this today because there’s so much compartmentalization. It’s very difficult for the average technologist, who’s in a silo, to actually go outside and look from desktop to server and all points in between to look at that application. There’s been some slowness to train people who understand capabilities of the desktop, their servers, their network infrastructure, and their SAN infrastructure. When there are problems that are pacing or gating their system’s performance, they can go and set up test points, or maybe already designed test points, to expertly monitor those various dependencies. That includes mission critical applications, your exchange servers, your mail systems, your internet. We could basically take a lot of our reactive people and repurpose them to looking at monitoring tools and monitoring systems that would eventually obviate problems, reduce the amount of reactive workload and improve the performance and reliability.

There’s a problem with documentation. There’s a problem with good Decision Support Metrics. It’s a holistic problem that doesn’t change overnight. It’s a cultural issue that starts with best practices and an intention to improve as you go through. All of that’s designed to help you recoup your monitoring tool investments. You’ve got a lot invested in those monitoring tools.

You always want to focus on that user performance and go in and do root cause diagnosis. That all ties together with another topic that I’m going talk about and that is Architecture Ownership. A lot of my customers are government contractors and large institutions that utilize outsourcing organizations or perhaps, you are an outsourcing organization, or perhaps you’re a government entity and you have a lot of contractors and those contractors are changing. One of the things that I like to encourage you to do is to make sure that regardless of what’s going on with your contracts and that sort of thing, that you maintain Architecture Ownership. That requires accurate technical documentation and to address that.

I’m going talk to you about that a little bit so that when you change contractors or a contract, you have a different outsourcing organization, you’re looking to outsource, or maybe you’re looking to insource and change from outsourcing back to your own insourcing, you want to make sure that you don’t have any limitations by indispensable contractors, indispensable employees because you own your architecture. You own your system. Now that doesn’t mean you can’t have a contract perform your Architecture Ownership functions, but it probably needs to be separate from your regular operations so that it’s more of an auditing function.

In Architecture Ownership, what do we mean by that? There are two components of it. One is, to know yourself. That is to basically render your architecture into diagrams so that you can see your environment, all your technologists can see your environment and understand it, and your architects can work around pieces of paper.

Today, with our lights out data centers and that sort of thing, it’s impossible to do this unless you have good documentation. Not only good documentation schematically and of the configuration, and that sort of thing, but also of your racks and that type of stuff because you don’t have, and you don’t want to have, people in your data centers. You really want all that work to be done either online or virtually or from multiple locations. In order to accomplish that, it’s essential to have good system documentation.

What we believe that you need to do is pull all the router switch and server and configs out and basically put those on paper in a logical schematic type diagram. That documentation needs to span multiple security zones and be scalable. You really want a diagram that’s not the server guy’s diagram, network guy’s diagram, storage area network guy’s diagram, or desktop people’s diagram. Whose is it? It’s not the application diagram. Whose is it? It is your organization’s diagram cross silo. It’s to put together an end to end diagram of all the pertinent detailed information, so you can put your finger on where an end user is connected, that’s getting poor performance and you can move your finger through virtual circuits all the way through so that you can find the logical and the physical.

If you have a bad switch port somewhere, of course you’ve got redundant systems, right? Redundant systems can sometimes be a problem because you may have a bad fiber on one of those redundant connections but it keeps flipping back and forth. Therefore, you have an intermittency. It does work but when it works, it works in a degraded state. Then it flips back and forth between the two. It’s important to be able to view your system all out so you can see everything.

Another part of Architecture Ownership is that I have noticed that there are literally sometimes hundreds of people who are in the IT organization who, if you really quiz them and ask them, don’t have a clue of where they fit in the organization technically. It might be a server person, network person, desktop person, virtual person, the person who takes care of the SAN or the WAN or what have you. Those people need to understand, in the end. Once you have a really good document that depicts your environment, then you can take that document and you can train all of the people. “Here’s our architecture. This is why we do it this way. Here’s how it works. Here are the tools and systems that we use to manage it.” Then you bring all of these people out, who previously might have been hiding under their desks when there was a problem, because you’ve enabled them to now understand their environment. You have your network rendered so that you can see it and now you train people.

What we believe that you need to do is train people in your architecture, your systems, your monitoring tools, your capabilities, your organization, your trouble ticket system, so that all your technologists are banging on all eight cylinders. Instead of having five or six people in your entire organization that are really into it and understanding everything, you bring it up to several hundred through training. What we believe that you should do is have a certification program for your architecture.

First you document it. Then you train everyone on it and you test them and certify that they truly understand before you give them the keys to your active directory, before you give them the password for the routers and switches and platforms. They go through this orientation and certification on your particular architecture so that they are not trying to make it look like where they came from, but rather bearing the hand to help you continue to build your architecture and your system. Architecture Ownership is a very important component.

Problem in Change Management, of course, is essential to, one, diagnose root cause, but also to enable rapid continuous system performance optimization. Doing root cause analysis, of course, that’s my area of expertise and that’s where I get called in a great deal. We come in, we diagnose the problem and then we do root cause analysis. Usually it’s had to escalate quite a bit before people finally say, “Hey, let’s get somebody in here to help us solve this problem. We have smart people. We have capable people but they’re busy doing other things. Let’s bring somebody in to focus on this.” We come in and we help with that solution.

Well, one of the things that we do when we do health checks, is check out your trouble tickets and find out what’s been pervasively causing problems. If we can basically prioritize those bad trouble tickets, we can then go mitigate the root cause of those problems and basically obviate future trouble tickets, which lowers your costs of your help desk and that sort of thing. Some organizations are so reactive that they just keep hiring more and more help desk people to be basically customer interface folks and that just keeps growing and growing so large and so wide that you can see the curvature of the earth in those cubicle areas where the help desk is. It makes a lot of sense to basically mitigate those particular problems through statistical analysis and then root cause analysis.

The first thing that you need to do is you need to do a problem statement. You need to know what your problems are. You need to reverse engineer your system documentation. Perform micro and macro analysis. Macro analysis is using your decision support tools to find out about where the problem is. Micro analysis is getting down to the packet level to diagnose the problems. That’s another thing that you need in order to get the critical problem resolution and have a really good organization. You not only have to have the macro analysis, SNMP tools, NetFlow tools, performance management tools, but you also need the ability to go down to the micro level with packet-level analysis.

Another thing that is helpful, if you have a lot of applications and you’re finding that you’re having to go out and diagnose problems a lot with your applications and your applications are said to be slow, if you basically embed into your application development an early and frequent deep packet performance analysis and then subsequent design of the monitoring system to look at those application vital signs, that will yield incredible ROI for you.

It’s our recommendation that you have regularly embedded deep packet inspection early in your development cycles and redevelopment, refreshing cycles. Rather than waiting and putting an application out there, spending $20 million on new hardware refresh, software refresh, and that sort of thing, and you go to roll it out, it’s got so many SOAs and interfaces with other systems that when you turn it all on and you start using it, you find out that the end user won’t use it because it’s so abysmally slow because there was no performance analysis done in a micro fashion incrementally over the period of time. You need to look at doing application analysis early in your cycle.

Here are some of the things that you can do. Identify your dependencies, do micro and macro analysis, and do latency analysis on your web, your SAN, your cloud, your metal ware, your load balancers, firewalls. Find out where your problems are early on. That will help you with those Decision Support Metrics to be able to design metrics to watch those areas that you know in the future, as you scale, might become a problem.

Then Business Technology Integration. Like I said, a lot of technologists and technology managers are finding that they’re in a silo, they’ve got operations and engineering for each different type of technology and it’s hard to get an end to end view of the entire system. We’ve come up with some ways that we believe that you build virtual teams to basically have a representative of each silo for your Architecture Ownership, for your Problem Management, for each one of those. If you put them on a team together, you can end up having a collaborative design environment where all your silos are working together. You can have documentation that is end to end from the client all the way to the server, the application, and all the various systems in between. You can really get some traction in there. That is dependent upon a Master Plan.

A Master Plan is a prerequisite for collaborative design. You have to know what your objectives are. We like to focus you under Business Technology Integration and Master Plan Development so that you document the “is” and then help migrate toward that future.

This is our cross-silo collaboration model. It is essentially a way in which you visualize all of your various organizations and all of your various silos, cloud, your network, your desktop, all these different things, and these folks, security and platform. You basically build out an Architecture Ownership group, a Business Technology and Master Plan Development group, an Application Development Optimization. These are virtual teams. You don’t have to change your logical organization or your budgeting silos. You just need to put this together so that you have the glue to help everybody understand one another’s environment in a team sort of way. Wrap all of that stuff up into your “is” Master Plan. Then you’ve got your “way ahead,” your “future” Master Plan so that you know where it is that you’re headed.

These are the things that you can do: Architecture Ownership, Decision Support Metrics, Business Technology Integration, Master Plan Development, Cross-Silo Optimization, Problem and Change Management, and Application Development Optimization. Really good things that can be in your forefront. Just remember, take a look at that Jefferson Memorial and remember that each one of those pillars and each one of those steps are steps to your user’s best interest.

If along the way, Apalytics can help you, Apalytics is dedicated to crisis avoidance through integration of best practices. Our TS is our Technical Services where we’ll come in and help you with existing problems. Our AS is our Assessment Services where we can come in and help you identify the risks and then help you develop mitigation strategies. Then, finally, our cornerstone services which help integrate and obviate problems through the expert integration of best practices.

I appreciate you joining me today. We appreciate you spending your time with us. I hope it’s been useful to you. If there’s anything that I can do to help you or if you have any questions about this presentation, just go ahead and email me at [email protected]. I appreciate your help in building out our industry, helping us all grow and understand what we should be doing. Apalytics. Application Analytics Software and Services. We’re at your service. We appreciate you and your joining us today. Take care.