Operational burden is a choice

Last updated November 5th, 2024

Time for devlog #5! Last week's was here.

This one isn't really a devlog. It's my thoughts about how software is made, and how I think we should have a shift.

Types of software

I don't like being oncall.

The Internet doesn't have opening hours. It's a distributed system that unfortunately comes with the expectation of being always available 24/7.

As a result, software companies have broadly shifted from selling standalone applications to selling online applications.

I think we should build peer to peer applications.

A standalone application runs on a computer that is owned and operated by the person using the software. It uses local computing power and accesses local data.

An online application is different: a portion (typically just the UI) runs on the computer that is owned and operated by the person using the software. The UI communicates with the "backend" storage and compute, which runs on hardware that is owned and operated by the business producing the software—or even on hardware that is leased from cloud providers.

A peer to peer application runs on a computer that is owned and operated by the person using the software. It uses local computing power and accesses local data. It may act as a server, hosting clients to work together on local documents; and it may act as a client, allowing users to work together on remote documents—all without a backend in the middle.

The perils of always online applications

Online applications let businesses observe & control the data the app consumes and produces.

This is not always a bad thing. Some observation & control enables features that help people.

Search and recommendation features often require a decent amount of data storage and processing that is typically performed within the walls of the business.

Recommendations may also require collecting, mixing, and processing data sourced from many different users to infer general user behaviors. This sort of thing is harder to do well with standalone applications.

But online applications come with costs to the business & consequences to users:

Data breach? Users have to reckon with their data suddenly becoming public
Business shuts down? Users often get left in the lurch
Service outage? Users have to deal with a broken piece of software

GitHub (centralized) ≠ git (decentralized)

Consider GitHub, a business built on the shoulders of a free and open source distributed system: git.

You don't need GitHub to use git. Every repository (by default) contains the full history of changes and may pull and push additional changes to any other repository. A git repo is just a bunch of files in a .git/ directory. They're all inert while not being used. There's no backend needed, each repo is self-contained and may freely interoperate with peers as long as there's a bidirectional channel between them.

But there are good reasons to use GitHub: issues, pull requests, and actions make it easy to build software with others.

A screenshot of GitHub when it goes down, featuring an angry looking unicorn

But easy features can cause lazy thinking.

Nearly every company I've worked for in the past decade had GitHub in the critical path to production. If GitHub went down (which it does, often), there was an incident—people needed to be oncall.

This is an architectural choice. And this kind of choice leads to a 24/7 oncall support rotation—an operational burden.

If instead, git and a few shell scripts was used in the critical path for deploying to production, these companies would be mostly immune to these sorts of outages—and the overall need for oncall would be reduced.

GitHub could still be used for the typical workflow, it just wouldn't be in the critical path.

Ironically, GitHub itself has also chosen to have a high operational burden. (I've never worked there, but can make observations about their architecture.)

Many of the business objects that GitHub offers (issues, pull requests, actions, etc...) could themselves be stored as primitives inside of git's distributed object model (as blobs, trees, commits, branches, tags, etc...).

If this choice were to be made, then GitHub would not need to run as many online services to manipulate these remote business objects. They would be stored them the same as ordinary source files tracked within git. And they could build UI to present them in the same, pleasant way.

There still would need to be online services, but there'd be fewer of them, and they'd be more limited in scope.

And fewer online services means less operational burden.

They probably don't do this because it would give away a piece of their value proposition: managing this data for you.

In a way, they're exchanging a higher operational burden for your data.

What's the alternative?

I'm not suggesting we go offline and stop building online applications.

Working on the same thing with people who aren't in the same room is table stakes these days.

Instead, we should start building peer-to-peer applications.

Running and maintaining a 24/7 centralized server to handle data and relay communication between parties is a drag. With a peer-to-peer application, each client can act as a server managing its own data.

This can even be done in a web application today.

WebRTC allows web pages to establish peer-to-peer connections without¹ an intermediate server handling the data.

Note¹: "without" an intermediate server is not entirely true. In order to establish a WebRTC connection between two parties, there are a few things to consider:

The inviting peer must be able to send an "offer" to the invited peer, and the invited peer must be able to send an "answer" back:
- Bad news: this requires an existing, trusted communications channel between both peers
- Good news: two people who want to work on same thing probably can already send each other messages using some trusted channel, so we can just ask them to do that
If the peers are behind a NAT, NAT traversal must be performed, which means:
- Using STUN for each peer to identify their public IP address and the kind of NAT they are behind
- Using TURN as a relay to bypass a symmetric NAT that cannot be traversed
- Good news: running a STUN server is fairly easy and low maintenance
- Bad news: while running a TURN service is also fairly easy, it may be costly with respect to bandwidth
- Good news: you don't always need a TURN service

Multiplayer peer-to-peer collaboration

It's hard to add "multiplayer" support to documents.

When I look around for best practices to implement this, I often see folks talk about CRDTs as a solution to manage the complexity of multiple parties manipulating the same document without conflicts.

CRDTs are fascinating, but to me feel like overengineered "solutions" to fundamentally social problems.

Every time I've worked with people on the same document, there's always the same power structure: one person "hosts" the session, and everyone else is a "guest" acting politely while changing the document.

As the host, I can make authoritative calls about the document's content and structure
As a guest, I can make changes, but I'm not going to make a fuss if those changes are backed out or altered

It can get a bit messy when multiple people are trying to change the same thing. (i.e. everyone adding items to the same list)

But that messiness is social, we end up talking with each other to come up with an ad-hoc strategy to work together. (i.e. split the list so each person has their own; merge afterward)

And once the session is over, the host typically does a bit of tidying up of the loose ends once everyone has left.

Build "multiplayer" support that embraces this power structure

Instead of reaching for CRDTs and distributed state synchronization (where all parties are "equal" in ownership and intent), let's reach for the same tools used when a client makes requests of data on a server.

There then is no difference between a client/server structure and a peer-to-peer multiplayer structure. The client is a guest and the server is the host.

Document coordination could be as simple as:

Event	Response
A guest joins	The host tells them the state of the world.
Host makes a change	The host tells all guests what has changed.
A guest makes a change	The guest asks the host to accept a change.
	Guests may optimistically update their local view (but be prepared to roll back).
	The host may automatically or manually accept/reject this change depending on the host's state.
Host accepts guest change	The host makes the change and tells all guests what has changed.
Host rejects guest change	The host rejects the change, resulting in no state changes.

Does this actually work?

Probably! I'm not sure, but I'm going to try.

Hopefully, I'll have an implementation of the above soon, which will allow multiple people to operate as guests on a document hosted by one of the users—without any servers requiring heavy maintenance involved.