Interview with Michael T. Nygard at GOTO Chicago 2014
Transcript
Hi, it’s Mike with UGtastic. I’m here again at GOTO Conf 2014, and I’m sitting here with Michael Nygaard, who gave a talk on DevOps at 5, where are we now? Thank you very much for taking the time to speak with me. Can you tell me a little bit about DevOps at 5? What does that mean, and how did you come to that presentation? Sure, I’d be happy to. So DevOps was actually coined by one individual. We know when the name was created. It was created by Patrick Dubois in 2009. Okay. So that was five years ago, and five years is an interesting time span. First of all, it’s a round number, and humans like round numbers. But it’s also sort of the time it takes any idea to go from radical to kind of entering the mainstream. It’s about half the time it takes for an idea to become completely mainstream, and it’s about one quarter of a normal career for a working programmer, or what you would call the time from entering the field to being a gri zzled great beard type. Yeah. So five years was an interesting time to sort of take stock , collect up many of the ideas that people have created, sort of provide a survey for folks who haven’t been in it for that time, but maybe are hearing the term. So five years was an interesting time to sort of take stock , collect up many of the ideas that people have created, sort of provide a survey for folks who haven’t been in it for that time, but maybe are hearing the term. So five years was an interesting time to sort of take stock , collect up many of the ideas that people have created, sort of provide a survey for folks who haven’t been in it for that time, but maybe are hearing the term. So five years was an interesting time to sort of take stock , collect up many of the ideas that people have created, sort of provide a survey for folks who haven’t been in it for that time, but maybe are hearing the term. So five years was an interesting time to sort of take stock , collect up many of the ideas that people have created, sort of provide a survey for folks who haven’t been in it for that time, but maybe are hearing the term. So five years was an interesting time to sort of take stock , collect up many of the ideas that people have created, sort of provide a survey for folks who haven’t been in it for that time, but maybe are hearing the term. So five years was an interesting time to sort of take stock , collect up many of the ideas that people have created, sort of provide a survey for folks who haven’t been in it for that time, but maybe are hearing the term. Right. In an earlier interview, we talked with somebody who talked about how they were using continuous delivery, working with the restaurant industry and able to deliver a change 10 hours before opening of a major bar. And that these concepts can be applied into serious product development that actually really touches an end user. And what might have previously been considered a, oh my God , we’re in a freeze mode. Don’t touch. Anything. Yeah, definitely. When you look at the reason for freezes and the sort of traditional deployment process, a lot of it is around risk management. The idea of saying an event, a negative event has a high cost. And so we need to avoid that cost. Right. I would turn that around these days and say, preventing that event also has a cost. It’s not as visible in the immediate sense, but it’s a cost that you feel. It’s a cost that you feel in the form of friction and delay in getting things done. Right. And so we’ve changed to a different style of risk management that says, I’m going to try and reduce the chance of a bug slipping through to production. But I’m also going to make sure that I detect it immediately and can push a new release that eliminates the bug immediately. And so in some sense, the number of opportunities for error goes way up, but the cost of any individual error goes way down. Because you’re limiting the scope. So is it trying to turn what was previously known as a risk period, like once we get into a certain period before it goes live, we have code freeze, to change is now not a risk , it’s just part of the way we approach the process? Well, I’m a little cautious about that because there still is some risk associated with it. And if you don’t adopt any of the other practices and you just go to fast deployments, you’re going to incur some damage. Okay. As you speed up the deployments, you also need to speed up your monitoring and your metrics. You need to make sure that you can detect a problem very quickly. So focus on that mean time to detection. I’ve been in many places where the mean time to detection is measured in weeks. It needs to be seconds to really say you’re doing the DevOps continuous delivery approach. And likewise, you need to be able to repair. The damage very quickly, Adrian talked about detecting a bad deployment within five seconds of when it goes live and automatically switching back over to the old system that’s still live and running. So in that case, you say a 10 second error, how many users are going to hit that 10 second error? Right. So again, I’m trying to put it in into dollar cost terms where I’m comparing the expected losses from bugs, slipping through to production versus the expected losses of bugs that are already in production that you can’t fix because you’re in a freeze. Yeah. So to me, it almost sounds like you’re trying to say we’re going to reduce that surface area of that first group that hits a new change. And then once we know, because we’ve kept that surface area small, we can grow it out and then be able to retract much more quickly. Absolutely. And I think of limiting it in space and time. So limiting the exposure in space is how broadly is my audience exposed to the new code right away? So we use techniques like feature flags and differential routing to let a few people in at a time. And then limiting the scope in time is how quickly can I fix the bug when it gets through? Right. So, yeah. So the speed of or even being able to turn it off. So like we need to use a feature flag is, oh, that one didn ‘t work. So that could be that instantaneous mitigation of risk. So the term DevOps, though, it’s still been kind of a high level term. And we’ve talked a little bit about monitoring and continuous delivery. When you say the term DevOps, can you kind of break that down into kind of pieces a little bit more? Yeah, I’ll try. Actually, the schema that I would use comes from John Will is, who talked about DevOps as culture, automation, measurement, and sharing. Okay. And culture comes first for a good reason. So the idea with culture is that we want to create a high trust culture where we have strong and deep collaboration between development and operations. This is why, by the way, I consider it a fallacy to hire a new DevOps team that sits between dev and ops and is supposed to be a bridge. Because instead of creating a… a tight junction, what you’ve really done is now made two handoffs instead of just one. Yeah, and they’re alien, and now there’s friction. Yeah, absolutely. And both sides feel threatened by this new team. So the culture is one of enablement and mutual support. So where previously operations was frequently held accountable for code they didn’t write and the availability and performance of said code, now we would close that gap. We would close that feedback loop and say operations is going to enable development to move things into production as rapidly as possible and also give development the tools to see the effects of what they do. Right. And now the responsibility is on development to use that ability wisely. Right. So instead of focusing on process and automation for self- protection, you’re focusing on automation to enable your partners and collaborators in doing this. So that’s kind of the culture that we want to create. All of the tools, all of the monitoring is in support of that culture. And, you know, in what I’m thinking about when you’re talking about the information and not looking at those metrics is how do you kind of protect yourself, but we should say mitigate risk, but or find blame, I should say. That’s what I was trying to look for is I think about if I ‘m… If I’m running an app and I launch it on my local machine, I’m watching my CPU and my RAM, I just want to know how it ‘s doing. And that’s a normal thing that I do when I’m writing code locally, but that we should also maybe look at our entire systems at that same level of detachment from what it’s saying, that it’s not something that we’re looking to protect ourselves from blame or faults, but just… How is this system working? Yeah, in fact, I would go even farther and say, when I’m putting code into production, I’m not just interested in how it’s doing on RAM and CPU. I’m also looking at the effect on the users. What’s the response time distribution? What have I done as far as average latency and the 99th percentile latency? What have I done in terms of conversion rates and revenue coming into my business? You know, these are things that developers… can care about and want to improve. Right. There’s a much older idea that says, you know, developers don’t care about the business. They don’t understand money. They don’t… You know, as long as they get to play with their code, they don’t care what happens to the business. I found that to be a foolish stereotype. And so now you’ll have developers who will optimize a chunk of code in order to improve the response time and get the conversion rate up. Right. So we need to think more. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Liggettion. G, that’s a good thing. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. ♪ ♪