Best of GOTO Chicago 2018

Speaking at GOTO Chicago's Open Source panel

Wow, what a week! I'll spare you from any May the fourth be with you puns (unless that in itself counts) and dive right into the best of GOTO Chicago 2018.

This past week I had the opportunity to attend my first GOTO conference, and overall I would say it was a good experience. I honestly was not blown away by the sponsor booths, and some of the talks were a little rough around the edges... but what conference doesn't have its fair share of flops? We're not here to talk about those though, I want to highlight the talks and experiences that really stuck out.

Let's start with...

Testing legacy code

My first exposure to GOTO was an all-day workshop hosted by Michael C. Feathers, the author of the Working Effectively with Legacy Code. While the workshop consisted mostly of techniques for getting legacy code into a testable state, the content presented that was presented would take up an entire blog post in itself (which I will assuredly do at some point in the near future). So I am just going to focus on two key elements that really resonated with me.

public versus published
We've all heard of access modifiers (private, public, etc), but what about published? The easiest way to describe it is to give an example, comparing it to public.

Public should be familiar territory. If something is marked as public, it can be accessed by other classes, subclasses, and even assemblies. Within an organization, it should be relatively easy to change public signatures. Sure it might be a lot of work to do so, depending on how many projects reference the public interface, but it's clear to see what needs to change and you can have a high level of confidence that you won't break anything. You may even have the ability to update all of the references yourself (which I would encourage you to do so).

Published, on the other hand, you don't have any control over. A prime example would be an API that other companies leverage. You can't change the interface willy-nilly because you don't have the ability to update the references. Published is an abstract concept. You don't really know if something is marked as published, because there isn't an access modifier for it. It just isn't supported in the language.

While this makes complete sense, I was just happy to finally put some context around the concept. Furthermore, I've been officially converted to the camp of: If you're changing a public interface, do your due diligence and update the references yourself. No more of this obsoleting a method that will never be updated.

Best is the enemy of better

In reality, after some google-fu, the official quote is:

Perfect is the enemy of good - Voltaire

Regardless, to me this was a simple yet powerful quote that I previously had not heard before. It stems from the idea that so many times in our career we spin our wheels indefinitely while trying to seek out the perfect solution. Maybe we never implement the perfect solution, because it's just not possible right now. However, it may be possible to implement a good solution. We just fail to realize it sometimes.

A big part of the workshop was refactoring code to make it testable. To do so, we made a lot of new classes that the old classes just called into. Our old code essentially just became wrappers around the new code. Obviously, this is not a perfect solution, but it's a step in the right direction to get the code into a clean and testable state.

After wrapping up the testing workshop, it's time to head onto over to the first Keynote of the conference.

Drifting into failure

Presentation: https://www.youtube.com/watch?v=mFQRn_m2mP4

The first keynote of the conference was given by Adrian Cockcroft, the VP of Cloud Architecture Strategy for Amazon. It was a little surprising, as he wasn't on the original agenda. Apparently the original speaker came down with a cold, so he filled in. Honestly, I'm glad he did. It was a very enlightening talk. Adrian focused on a central theme, drifting into failure.

Drifting into failure stems from the idea that people in the workplace may not report issues to upper management for fear of inadequacy or even worse, losing their job. These issues, albeit small at the time, start to pile up eventually leading to a catastrophic event.

One example of this was the fact that airlines that had more reported incidents, actually had less fatalities. At first, this seems pretty counter intuitive. How can you have more incidents, but less catastrophic events? Does this mean that an airline with less reported incidents actually had more fatalities? Yes.

The take away here is that incidents are going to occur no matter what. It's just a matter if they're actually reported. Just because an incident isn't reported, doesn't mean it didn't happen. We can't make our processes better if we are unaware of the problems that are occurring. We need to embrace blameless postmortem cultures so that we become aware of these incidents and do not eventually drift into failure. So how can we prevent this from happening? Create non-events.

Adrian spoke of dynamic non-events. It wasn't clear to me what the actual definition was, so I looked up the meaning in its entirety.

Reliability is a dynamic non-event. It is dynamic because the processes remain within acceptable limits due to moment-to-moment adjustments and compensations by the human operators. It is a non-event because safe outcomes claim little or no attention. - Karl Weick

Creating a non-event essentially means to report abnormal behavior, or at least be aware of it. It doesn't have to be catastrophic, and in all honesty it shouldn't be. If a metric is slightly incorrect or just under the minimum, it should be reported and discussed before it balloons into a much larger problem.

The example that he gave comes from a book, Drifting into Failure by Sidney Dekker. A component of an aircraft passed all safety measures, but barely. It was never reported, because everything was technically fine. The employees did everything they were supposed to do as laid out by the system, but in this case, the system was wrong. It lead to a disastrous crash, killing everyone on board the craft.

Non-events are a time for people to report and learn. No one should be afraid to report bad news, or the possibility of bad news. We should strive to actively create safety. His recommendation to create safety?

Break things on purpose!

GOTO had quite a bit of content on a concept that I wasn't completely aware of, chaos engineering. I knew it existed, but never really dove into the specifics of how it worked or how to practice it. This is another topic that I could easily dedicate a whole post to, so we'll just touch on some key takeaways.

Chaos engineering exists because production hates us. Leveraging DevOps is a step in the right direction, but unfortunately, production is a war zone. Nothing can guarantee that when you push the deploy button, everything is going to work exactly as you expect it to. In all honestly, production doesn't want your code to work. Chaos engineering is really about engineering ourselves out of chaos. Not introducing it into our systems, that would just be silly.

It was noted that people actually cause the most chaos. One example that was given was a command line argument. If a certain command was passed with the letter g, everything was a-OK. However, if you were to forget in that letter g, all hell would break lose. One may think this is just another case of user error, but the speaker presented it as a system error. Why is the line so small between success and failure?

So how do we implement chaos engineering? Typically through learning loops. Chaos engineering is all about "you know what you don't know", so you should actively prod those possible failure cases. The implementation steps were broken down as follows:

Prod the system
Hypothesis of why it happened
Communicate to your team
Run experiments
Analyze the results
Increase the scope
Automate the experiments

In the end, the goal with chaos engineering isn't to break the system intentionally. You would never want to intentionally take down production and impact your customers. If you know of a failure case and how it behaves, there really isn't much value in running a chaos experiment. Everything is already known!

Old is the NEW new

Presentation: https://www.youtube.com/watch?v=AbgsfeGvg3E

I love Kevlin as a speaker, he always does a fantastic job. I highly recommend giving the presentation itself a watch, but the Cliffs notes are essentially that everything in computer science has already been done. Most of the new stuff we see today has appeared in literature that existed as early as the 50s. Software design principles, testing approaches, you name it.

The vast majority of the presentation was comparing new ideas to old ideas. Highlighting the fact that we've been thinking about and solving the same problems for quite some time now.

To me this is very similar to Uncle Bob's post, The Churn. Which states that we as developers are so infatuated with learning all of the new and shiny frameworks and libraries that are released, we drift away from the basics. We forget core principles, and never really advance because we're stuck in the churn of learning every new thing that comes out.

I agree with most of it. I couldn't imagine trying to pick up every little new thing that pops up, and frankly I just don't think it's possible.

In the end, it was a worthwhile conference. I may just have to add it to my list of yearly must GO-TO conferences...

Until next time!

John Reese

Go . Kubernetes . DevOps

Best of GOTO Chicago 2018

Testing legacy code

Drifting into failure

Break things on purpose!

Old is the NEW new