Thursday, October 20, 2016

Releasing with balls of steel

Product Owner: Henri, can you check your email? I forwarded you an email where people are discussing about the address issue.
Me: Sure.

I read the emails and see that the one known address related issues is actually causing bigger problems than expected. Good news is that we have already fixed the problem, it just hasn't been released yet.

Me: So, sounds like we could still do the release today, after all.

1.5 hours ago in our daily standup I had just concluded the obvious thing. If it is was a normal day, we could do a production release today but today is not a normal day. Today is the single most important day for the whole corporation because of what happens in the evening. So no releases today. And besides, it's Friday, Fridays are never the best days to release anything.

But now the situation has changed. If we don't release, there will be certain customers whose purchases will cause lot of extra work to the customer support because the address data won't be saved correctly.

Me: Hey, let's go to the lunch now. When we come back, we start preparing the release. If it looks like it's not going to happen, then we won't do the release. But if everything goes fine, then this release would be pretty valuable. But I want to emphasise one thing: we have to be extra disciplined today.

Product Owner: While you were at lunch, I tested the address fix and it works. But it seems that we have another address related problem. The street address validation isn't very optimal.
Me: Well, that should be a very small thing to fix. I wonder if that thing actually explains the yesterday's complaints about how some people weren't able to purchase successfully?
Product Owner: Could be so.
Me: It seems to me that we should definitely take this fix to the release as well. It's quite bad if the customers want to buy but aren't able to do that.
Product Owner: I agree.
Me: Ok, let's split the work here. Can you Developer fix this second address issue?
Developer: Yes, I can do that.
Me: And I continue looking at this login change related thing.
Product Owner: I continue testing the other new features.

Me: I have found a possible solution for the login issue. It would fix the problem but then cause another problem, which maybe isn't not that big thing. Hmm... Or, actually it ruins the whole idea of what this login feature is all about because... Ok, I have to think a bit more.

Developer: I have the address fix done.
Me: Good, let's review it right away.

Me: This looks good. I wonder if you could next check what is wrong with this one end-to-end test. I first thought that it's a random failure in the APIs but this has failed quite many times recently. I'd say it's quite important to check so that we get the tests green before the release.
Developer: Yes, I can look at that.

Me: This new login feature is a bit more complicated than expected. I suggest that we don't include it to this release since the risks are too high and the possible benefit would be close to zero, I mean what comes to the evening.
Product Owner: Sounds good to me.

Developer: I found the reason for the test failure. It seems that this particular XHR call has quite short timeout.  If the API responds a bit slowly, as it seems to be doing quite often, the user sees the error message.
Me: Makes sense. Well spotted! And I have now reverted the login changes. We can continue finishing the feature on Monday. Let's review your change and then start making the release together.
Developer: Ok.

Our build process in CI is annoyingly slow. I add a ticket to our board that we have to do something for it in the near future. On the other hand, this is good time to think if there's something special that has to be taken into account in this release. Besides the special day, there isn't. It's actually quite the opposite. It's important to do the release exactly the same way as any other release.

Me: The latest version is now in the dev environment. You can test the address fix and we will see how the end-to-end tests will pass.
Product Owner: Ok.

Product Owner: Yes, the address works well now. On my behalf you can do the staging release.
Me: We also checked the tests and they are good too. I will start the staging release.

Me: The staging release is now done. I will run the load tests next.
Developer: I can check the staging end-to-end tests as they finish.
Product Owner: And I will once more check the new features in the staging environment.

Developer: The tests are ok.
Me: And so are the load tests, nothing special there.

Product Owner: Ok, I have moved all the cards to the "Ready for prod" column. Will you do the production release next?
Me: Yes. And I will inform the stakeholders.

15.00 (chat)
Me: Hi @DBDev, @EveningBusinessOwner and @EveningDev. We just did a staging release and the prod release is ongoing. This release fixes the address validation and address saving problems.
DBDev: Ok
EveningDev: Friday 15.00. You have balls of steel.
EveningDevelopmentManager1: Did you do it "jacket on and compile"?
APIDev: same thoughts...
Me: :)

Me: The production release is done now. I clicked through the main paths and everything ok so far.

15.05 (chat)
Me: Release done.
EveningServiceManager: We have today "pretty" important day!
Me: And now we are a bit more ready for that
EveningBusinessOwner: Tested?
Me: In many ways. Unit tests, code review, end-to-end tests (in dev and staging environments), Product Owner's tests (in dev and staging environments), load tests. We also went through the new features critically and left one risky feature out of this release.

15.09 (chat)
EveningBusinessOwner: Redirect isn't working
Me: What redirect? I just tested your login and it worked fine.
EveningServiceManager: How about registration and purchasing?

Feels a bit cold. We don't know what he meant about that redirect and we are just waiting for some additional information. I'm thinking what we have released and can't figure out anything that could have broken their part of the system. Calm down, I'm saying to myself. We have done this as well as we can and so many times before. If there is an issue, it must be something that has been there also before.

15.13 (chat)
EveningBusinessOwner: It works in buying. Not in registration but this is actually on our side, we have never implemented that.
Me: Ok
EveningDevelopmentManager2: I'm bit worried about this release at this point when there has been so much other things too during the week.
Me: I understand that but we did this release exactly to support you.
EveningDevelopmentManager2: I think I'm gonna have a relaxing beer. Let's hope that everything works. We expect plenty of orders today and tomorrow.

15.22 (chat)
APIDev: Registration has already started, it seems.

15.30 (chat)
APIDev: There seems to be quite many errors in regisration
Me: What kind of errors?

We figure out that the errors are caused because our registration flow is not technically optimal. We could avoid those errors by making an additional API call before posting the registration data. For the user this has no effect. In both implementations she would see exactly the same error message.

Little by little the situation calms down. The big night has started and we follow the production charts. We can see a nice spike in the usage data but a boring, flat response time graph.

A couple of hours later we wish everyone good night. Everything has gone really well, at least technically. At this stage we don't have any business numbers available.


When is a good time to release new features and fixes? If you face the question one-sided only, the answer is pretty simple: whenever the release makes the system better. On that Friday there were two identified issues with clear solutions and from that point of view we definitely wanted to release.

But everything is (at least) two-sided. The other side in releasing is about the risk. What if the intended changes we have made will cause some unintended changes?

In the traditional software development there are two ways to approach this dilemma. The other one is to admit that the risks are too high and not to release. The other one is to quickly hack solutions, deploy them, and then probably quickly hack the fix to the problem that was caused by the initial fix. What is common for both is fear.

However, software development of 2010s is not fear driven. It is based on agile and extremely disciplined process. It is based on code written with great quality, on automated tests with good coverage and quality, on automated build and release processes, and on highly professional people working as a real team.

"Software development of 2010s is not fear driven."

I am very proud of how our team was able to do the release on that Friday and I want to emphasise that - this may disappoint some of you - it wasn't about cowboys doing heroic actions. It was rather people following the process that has been tuned along the way and is still improved all the time. Also it was about certain amount of humility when admitting that the feature that does not work well enough is excluded from the release.

What is also important to understand is that this wasn't only about what we did on that day. 99% of this was what we had done before, in previous commits, in previous code reviews, in previous releases, and so on. In a situation like that you cannot fix your process. But if you have done your work well earlier, you can have great possibilities to react to the changed situation. Which I guess is what Agile is all about.

No comments:

Post a Comment