I have been working professionally as a Software Engineer for the past 10 years. In that time, I've learned a huge amount, gained a bit of confidence, and largely ignored the social nature of our field. I haven't given back to the community and now feel like it's a good time to change that. I've been very lucky in my career thus far and want to share the broad lessons that I've learned along the way.
This is part four of a series of pieces written reflecting on my career:
- Part one: How to solve problems
- Part two: Study other people's code
- Part three: Burnout is self-inflicted
- Part four: Fear is the mind-killer
- Part five: The value of a test
Fear is the mind killer
When I started at IMVU, it had a large, complex, and tech-debt-ridden codebase. I say that not to shame the company or its codebase, but to acknowledge that the company was operating at a loss for 5 years prior to my joining in 2010. Unwieldy codebases happen and as far as I can tell are perfectly natural. That being said, as a newcomer, it was difficult to reason about and difficult to extend. Comprehending the code and its original intent was hard, and making deliberate changes often would require many false starts. Initially, the code was daunting, I was fearful and avoided making changes to deep/core areas of the codebase.
Looking around, the engineers who were the most effective were able to do amazing things, changing large swaths of the core codebase without causing breakage. I asked Chad how he did it, and he would quote Dune, saying "fear is the mind killer." The engineers who were the most effective wouldn't avoid making changes to the scary areas that were used heavily. Instead they would tackle them head on, taking the time to question the assumptions of the APIs and proceed in a direction they felt was right.
There was a pattern in their work, they were not afraid to make mistakes, were not afraid to break things, and were very quick to challenge and verify their assumptions. Need to know if something is used? Delete it. Need to know how an untested component works? Write a test to reverse-engineer its behavior.
A culture of fearlessness
This quick and fearless approach was not a lone pillar of Chad and the other lead engineers. It was surrounded by a culture which emphasized and supported reliability and clear communication. The organizational structure supported this in its processes & patterns, and the engineering organization supported architectures which helped foster this type of effective engineer.
IMVU had a mailing list called "change@" which was meant for developers to send messages to after they had performed any notable change to how the product or developer experience worked. It was at the discretion of the maker of the change to decide whether or not to send an email to this list. Aside from the details of what the change was and why they believed it was a good thing, there as a standard postscript that was included in every single email, which answered a few questions, notably:
- I wrote/DID NOT write automated tests
- I verified/DID NOT verify my change locally (in a dev environment)
- I verified/DID NOT verify my change in production
- I showed/DID NOT show my change to someone/anyone
These emails served both as a stream of changes for our support teams as well as a checklist of practices that helped reinforce overall stability. There was no shame in making a change without writing tests, but it certainly helped to publicly acknowledge that tests weren't written. When a change couldn't easily be verified in production, it helped to call out to others to be on the lookout for potential problems.
We discussed failures openly and without hesitation. Whenever there was an outage, system failure, or other surprise, we would research what happened and identify all of the contributing factors of the failure. The terms we used for these "post-mortems" were "root cause analysis" or "five whys"—terms the Toyota company used to deal with production failures—but the process was not one executed by searching for blame. We would create a timeline of events, looking for problems which could have been identified earlier or prevented by automated means or manual processes. As a follow-up to these meetings, we would create two sets of follow-up tasks:
- Ones that must be done, which would have helped identify, diagnose, or reduce the impact of the outage
- Ones that won't be done, which helped us acknowledge the limits of what we were capable of doing at the time
And of course, with full disclosure: there was always a growing backlog of these remediation items, even those which were marked as must be done. However, we always strove to make progress against these.
My first big failure
In my first year at IMVU, I was on a team which handled the payment backend and anti-fraud systems. I made a small change to our payments backend which was intended to add experimental behavior for staff members and normal behavior for our customers. But I screwed up. Here are the main events:
- I ended up getting a boolean expression backwards and shipped the experimental bits to all customers and left the old behavior for our staff.
- I didn't send a change email notifying everyone of my change.
- I verified that payments worked for myself (which they did, because I was a staff member).
- I didn't monitor the affect on our customers.
- I didn't verify that the experimental code path was behaving properly.
To make matters worse, the bug happened to occur at a place after accepting payments, but before delivering the purchased goods. We took money without giving anything in return. I broke things pretty badly.
It took an hour or so to identify and fix the regression, and much more time to identify and deliver the purchased goods to the original purchasers. When I realized what I had done, my heart sank to the pit of my stomach. I did some mental math and calculated the measured loss in revenue compared to my salary. My mind was racing and I was honestly fearful for my job.
Thankfully, once we corrected the problem I wasn't blamed (aside from a stern, "don't do that") for the failure, but instead I was strangely praised as someone who had managed to overcome our "immune system" of automated alerts, graphs, and charts.
Failure is an opportunity
My manager took me aside to talk about what happened. We talked through the problems and asked me what I could have done better. He told me straight up that I didn't have anything to worry about. It was fine for me to be feeling pretty low, but also that I've got an opportunity to learn from my mistakes and be a champion for others to promote better practices of communication, testing, and monitoring. Going through the post-mortem led to a number of changes from the high-level how we communicate about changes to sensitive areas, to the operational monitoring of how we mark successful transactions, to the low-level names of our functions to make them less prone to double-negatives in booleans and how we can identify code which is only executed by customers and not staff.
I had my tail behind my legs for a few weeks, but despite my guilt I also made things better or everyone. I forced the organization to improve its defense and caused myself to become a better engineer both in how I communicated with others and how I wrote and verified code.