Community Blog

OSFF Keynote Insights: FSI Cloud Native Incident Response Readiness Tabletop Exercise

Written by Win Morgan | 8/21/24 12:52 PM

At our Open Source in Finance Forum in London, industry leaders gathered to explore the future of cloud-native incident response in the first FINOS cloud native incident response tabletop exercise (TTX). Watch this recording for an overview of the findings from Francesco Beltramini and Ash Ward from ControlPlane and Laura Penhallow from Qudrature.

Are you a security professional at an FSI organisation? Get involved in our next tabletop exercise at OSFF NYC 2024.


Join experts from across financial services, technology, and open source to deepen collaboration and drive innovation across the industry in order to deliver better code faster at Open Source in Finance Forum.

TRANSCRIPT

Ash Ward: [00:00:00] So yes, we're really proud to have been able to run the tabletop exercise. For those of you who aren't familiar, that's where we take a real scenario that's happened and pose them and pose the scenario to the people who have attended. As you can see, we've had CSOs, head of InfoSec, pretty senior and important people, of which I am not, and they've given their, how they would respond to this scenario that took place.

So it is, we, we do it under the Chatham House rules, it's behind closed doors. But that then leads to a problem of how do we share some of the learnings and the information with all of you. Fortunately, I was not at the tabletop exercise. I, we're going to look at the lessons that were learned, and then I'm going to ask Francesco and Laura who were.

there to sum up what the learnings were so that we can get a little bit more detail from them. Laura's not going to be talking about anything specifically to do with her or her organization, but instead just recapping some of those lessons that we learned so that we can all, we can all get takeaways.

So the first takeaway [00:01:00] then was that there's a general lower confidence in cloud native instant response readiness compared to traditional on prem. Laura, I'm going to kick off with you, if that's alright. What's the difference? Why would I care that it's cloud native as opposed to on prem?

Laura Penhallow: It's new.

That's, in two words, it's new. Traditional security tools, for instance, in response, they don't work. That means that we have to shift from tradsec to cloudsec. And there's a learning curve, and then there's lots of firms that are adopting cloud native technologies that still have traditional security needs, and so then you, do you have two teams?

Do you upscale one team? Do you grow it? So there's a lot of decisions that you have to make tactically and strategically to make sure that you have adequate coverage, but I think that overall everybody was a little bit We've got some work to do still.

Ash Ward: And Francesco, would you say that was the same thing?

That it was a, without, giving away too many details, was that something that was a theme across a large amount of people that were there?

Francesco Beltramini: Yes, indeed. So it was identified [00:02:00] that these technologies tend to be more fast paced. And more importantly, they have different operating models where the responsibility line between what you can do as a consumer and what the cloud provider, for example, makes available to you.

So it's actually hard to keep up from an operational and security standpoint. The risk we have is that detection and response capability. May not be at the right maturity level or correctly implemented overall critical to keep in sync the strategic technology a road adoption roadmap of your organization with These very detection and response capabilities to avoid gaps and in general achieve a higher confidence in your again Ability to respond to threats in such environments

Ash Ward: That's good, thank you.

And for everyone's benefit, assuming that we keep to time appropriately, then we will throw out for any questions for Q& A at the end, so we can get, dive into a little bit more details if I can make it quicker work. So the second takeaway then, no [00:03:00] normal baseline behavior, regardless of the technology.

What does that mean? Why do I care about Sona Bank?

Laura Penhallow: We have to know what's normal to be able to know what's weird. So if you don't know, actually know what a day in your organization looks like, or if it's normal for Ash to log in at two in the morning, maybe you are, maybe you're a night owl, maybe you're not, so you can't figure out what's weird to call it an incident, start investigating something if you don't know what's normal.

And that goes across tech and business related processes.

Ash Ward: Of course I do, if my boss is in the room, then yes, I regularly am working at 2 in the morning. Who doesn't? Francesco, do you have anything to add to that one?

Francesco Beltramini: So in general, it was identified that technology is actually a mean to enable a business mission, right?

However, humans are still very important and still very much in the picture. Technology and automation associated with technology can introduce for sure a higher degree of determinism, which is easier to profile and actually allows you to focus on [00:04:00] the human behavior. And this is not really to act as an internal police, but it's really to detect and understand anomalies.

And What in the context of being an indicator for actual malicious activity?

Ash Ward: It must be incredibly difficult to do with all the information that you'd be getting for normal and to normalize behavior And but that's it's really good to see the takeaways even if they are difficult things to achieve and I should suppose then something That I have to deal with in a day to day basis is ensuring the right people are in the right roles But I'm looking at that for getting the most effectiveness out of my staff Presumably this is slightly different.

Laura Penhallow: Yeah, it's more than just making sure the right people are doing the right job. It's actually planning ahead of time when you have an incident response. When you have an incident that you have the right response team assembled and it's not going to be the smartest person in the room or the specialist here.

It's going to be the person with the clearest head that can just objectively look at things and help make decisions and help separate the craziness that happens if you have a major incident. So you want those [00:05:00] people right away that you can grab them and And then also make sure that the people that are actually investigating the incident can do their jobs and the comms can be managed effectively also as well.

Ash Ward: I mean having been involved in major production incidents as opposed to security incidents are going through. I know firsthand how difficult it can be sometimes sorting out the noise that's coming in and what's there. Francesco, you were taking notes from the, in the main from everybody who was there, what would you add to this?

Francesco Beltramini: It's everyone recognized that a secure organization must be staffed properly at all levels. We've got the engineering teams. We have. analysts, incident responders incident response managers, and of course CISOs. The fine balance that must be achieved is the one between technical skills and soft skills.

As a matter of fact, the most critical role during the incident was identified as being the incident response manager, because this guy doesn't need to be particularly technical, right? They don't need to know how to isolate a Workload on a kubernetes cluster, [00:06:00] but what they do need to know is how to maintain things calm and know how to manage a stressful situations And to more importantly gather all the relevant information to then take very informed decision to progress on the investigation and eventually declare an incident and eventually All the follow up steps.

Ash Ward: Wonderful, thank you. And again, I've seemed to fail using a click. There we go. Francesco, I'll pick on you first time around for this one. Healthy transparency when communicating internally and externally. Just make sure your messages are clear. It seems a bit of a given. I feel there's going to be more to this.

Francesco Beltramini: Yes, somewhat. So indeed works again. Up to your mouth, like an ice cream, they said. Yeah, beside of the right people in the right positions is really key to know who to talk to in the larger organization and how and to establish trust. Now, trust is achieved indeed through transparency. But of [00:07:00] course, bear in mind, transparency is not necessarily full disclosure.

We were speaking with security professionals that recognized the need at times. To choose what information gets to whose ears to avoid worsening the situation or misunderstanding until the context was fully understood and built around the context around the incident response.

Ash Ward: Now the best thing about being here and getting to ask the questions is I get to use my superpower, which is being very stupid.

So what, why, I'm communicating internally, to make sure that then very important people aren't getting very upset. What am I doing externally? Who am I telling? What's that all about?

Laura Penhallow: You might be telling the regulators if something, God forbid, happened, or if you've got customers, you might need to notify them, but timing's key.

So you don't want to tell customers like, Oh, we've got a problem, but we don't really know what it is, because then they're gonna lose confidence in you. And the same thing, you don't want to tell the regulator too early. And, but you can't not tell the regulator we brought up quite a few examples of other firms not in the room today, where they waited [00:08:00] too long and actually there's a recent SEC finding for an MSSP in the United States that waited too long, where they knew about incidents, they didn't say anything, they didn't disclose up, and so timing is key, so like to maintain confidence, but also be transparent, so don't try to hide anything, but the right information at the right time is really important.

Ash Ward: Yeah, that is interesting, because it's not then just about managing the incident and the response to the incident, but actually there's communication that's needed and regulatory requirements. And so then, that's coming into then the response runbooks. The testing and understanding them for technical and business needs, as everyone can read off the slides themselves, is this as simple as a disaster recovery plan isn't a DR plan unless you've tested it.

Laura Penhallow: Yeah, and who updates their runbooks all the time? Mm hmm. I can't raise my hand, honestly. We've got lots of run books for lots of things, and keeping them up to date is really hard. So having response run books and then testing them, even in small scenarios, is really key. And especially when you make major [00:09:00] technology changes that's going to save your bacon every time.

And it's not just for technology, it's how do you respond to the regulators? What's the communication template when we have to send something out to customers? It's having that. So you can almost do it. Asleep when you shouldn't, but it's But that you can just do it. You don't have to think about It's already there for you because we've already discussed it.

We've already agreed upon it And so it's just already ready for you. So then you can focus on what's real and then gather the information and move on

Ash Ward: Yeah, that makes a lot of sense. I appreciate not doing it in your sleep, but if you've already thought it through. Francesco, have you anything to add to that?

Francesco Beltramini: Yeah, so these runbooks, indeed, need to be accurate and complete, especially for Cloud Native. But equally important is to rehearse, and rehearse them. It is almost more important than obsessively detailing every single step, because these steps will very likely change at the next release of a platform.

Runbooks 100 percent aligned with the business needs and the mission, and they must be fully [00:10:00] aware of the impact to the business about each of the steps that was widely recognized within the program.

Ash Ward: That's really interesting, because one of the things that always gets me is that sometimes when we're moving fast and we're doing a lot of different new technologies and ways of working, that actually, a runbook is out of date as soon as it's written.

But I, that's a good takeaway to go, that may be the case, but you're still running through the runbook to understand that where there would be a difference, that's okay, because we can handle it and we can move on from that.

Yes. Don't assume. And we all know what happens when you assume you have to challenge your thinking every step of the way and ask the right questions.

So that's, that is a lovely piece. These are genuinely things that came out of the tabletop, tabletop exercise that people agreed and these were takeaways. But I read that and I think that looks like a bit of marketing, isn't it? What does this mean then? Francesco, you can go first this time.

Francesco Beltramini: Sure. So yeah this was the key takeaway. And let me expand on that. So the You mustn't assume during the response, but [00:11:00] spend time calmly to build the context during the incident. Each decision and step must be thoroughly informed by the context you are building. This will actually avoid rush decisions and overall worsening of the situation.

And this approach is very important to make sure you don't miss something or actually upset someone as you reach out to the organization and to your bosses when you have to report back to them. So calmly understand the context that provide the right information when you're sure about things.

Ash Ward: It is an interesting one, that challenge you're thinking each step of the way and about asking the right questions.

What was your takeaway for this?

Laura Penhallow: It seems like it's such common sense, but it's really something that you have to stop and think, and there were more than a couple people that had raised their hands and actually, I've learned this the hard way, that you can't just oh it's this. Let's shut it down.

Let's isolate. This is like without going around and say, you know what? Is this the right thing to do? And should, or maybe we should speak to other [00:12:00] teams, maybe look at the wider impact and make a risk, a decision based off the risk appetite for the business. And so what you think is right, may not actually be right for the business.

So that, that's, that was like a really key point, which, cause we all know what probably is the right thing to do in a vacuum, but it might not be the right thing if you were going to cause a business millions of dollars or pounds and, or cause more catastrophe in taking actions without. Making sure that the water impact isn't felt.

Ash Ward: Oh, that is interesting. So instead of just going to quickly switch everything off that could actually cause a huge bit of reputational damage or anything else. Or get rid of evidence that was needed.

Laura Penhallow: Might make it worse.

Ash Ward: Oh, that's really good. That's a nice takeaway. We do, I think, I'm going to look and see if anyone's going to shout at me to say we've not got time, but I'm presuming someone will.

Would anyone like any clarification or is there any questions that we can answer?

Oh, [00:13:00] this way. The one I will sit on. Alright. Now someone will have to ask a question, otherwise we've gone to all that trouble for no reason.

We've got one over here. Please.

Audience Member: I just wanted to come back and see if you could expand a little bit on that very first point, and when we're talking about the new technology, okay, it's different to what we're used to running on prem. Is it purely just The fear factor you think or are there fundamental technical differences that are making it harder.

So the fear is justified.

Laura Penhallow: I Think that it is different and it's like it's a different way in how you think of things like how you investigate and Kubernetes, or a container incident, it's much different than how you would investigate a bare metal incident. My first thing might be to log into some into a command mode when, start capturing logs, or do a TCP dump, and or look at processes, but you can't do that in a container or in a cloud native environment.

Those things, and then tools that you can off the shelf commercial [00:14:00] tools, haven't quite caught up to where traditional security tools are, just because it's not quite mature, but Make no doubt, hackers have totally found a way up. It's a new cat and mouse game, so it's exciting because you get to learn something new, but it's terrifying because you're going in blind all over again.

Ash Ward: It was quite exciting, actually. I spoke to someone today. I'm not, I don't pitch product. I'm not trying to sell any product. And somebody was talking today about actually, with the cloud service providers, you can get some deep insight into what's there. So it's perhaps before I'll talk on my, in my experience, then unless you've gone around and touched each server and know what each server in your data center is doing, you would regularly have to have a walk around to go and see, confirm that your CMDB was up to date.

And this chap was talking to me and saying one of the exciting things he finds with the cloud service providers is that instead they're able to get that inventory through them because they've got all it there. There's nothing sitting, spinning under a desk somewhere or an old Ultra 5 that's running something.

Do you have anything to look, Jamie?

Francesco Beltramini: Yeah, just to add the Having managed the security [00:15:00] operations teams, there is always this sort of, I don't say disconnect, but gap between what happens elsewhere in the organization and the security ops guys. There's always this catch up game, because you've got a business mission.

Other teams are really under pressure to build something new, deliver something new. And then they let us know that, oh, this needs to be monitored. So it's about rushing into integrating with your monitoring stack. And sometimes the tech they use is new to the organization and to the incident responders.

That's why I mentioned the keeping in sync the adoption of new tech. And the responders and their tech stack.

Ash Ward: Someone putting in a DevOps platform in a company one time. That's because you always tell me no. I try and do this fun stuff and you say no. No, I'm joking of course. Are there any other questions?

We've got another minute and a half there. There's one on the back I think I can't really see. It's just coming to you.

Audience Member: Thanks. When you talked about looking at the sort of people you need [00:16:00]in your response team, one of the things I came across an incident I was involved with before was also response fatigue.

It was a long incident and therefore we needed to actually have two people effectively for each role. Did that come up at all? Did you consider that at all in your scenario?

Francesco Beltramini: It's more than two people per role because of course if you have. If you want to maintain a 24 7 response capability, you need these guys need to sleep sometimes, right?

Analysts usually tier one, they're perhaps less skilled more junior. They look at the dashboards and usually maybe seven six, seven guys. But then as, of course, you get to the, say, above them, you have a tier two, and then you have the incident response manager. Those are, yes, they must be redundant.

Audience Member: What about when it gets down to the specialist knowledge, that's where we hit the problem. So yes the SOC was, had that sort of structure and the ability to do it, but when you were getting down to the person that knows this application and knows where the bit is, that's where you [00:17:00] come to those sort of points.

Francesco Beltramini: It's about mapping out your key stakeholders in the org and making sure you have primary contact number and then the backup contact number to reach out to them.

Ash Ward: Would you say that when that came up, as when doing the tabletop exercise, that was something that was aware, or would that come out through doing runbooks or rehearsing?

Laura Penhallow: What do you think? I think, yeah, that when you do the runbooks, you want to make sure that it's not the most senior people that are doing just performing the runbooks. So they can, they probably do it with their eyes closed. And so it's making sure that you make sure that new people, or junior People can do those runbooks.

So if something happens to that senior person that there's not a gap. And then also a lot of making sure that people are in the right roles is when you're thinking about what happens if we do have a crisis. So who do I want in my team. So obviously, if it's an application, you need an application specialist.

But who does the comms? Who's going to be the incident manager? So who's gonna be the face of this? And just know that ahead of time. Wow, I really want him because he seems to be so calm and level headed. And he won't, he'll be able to just herd people around and [00:18:00] like shuffle people out, let them do their work.

So it's those people, but also definitely in the day to day.

Ash Ward: Yeah, and you don't want to, you want to identify that you don't want to speak to me because I'll probably have had a gin and tonic by that point, not be anybody, any use to anyone. Thank you for the questions, I really appreciate that.

We have, I've got the timer telling me that we're at zero so I think we will wrap it up and say thank you very much everyone. Thank you.