Gooey internals: simplifying the graph
Last updated
Over the past few years, I’ve been building a frontend web framework called Gooey. This article is part of a series around how it works:
I haven't really talked much about what I'm actually doing with Gooey. So I'm gonna write about the problem I've been working on this past week.
Gooey uses a big directed graph to represent state dependencies.
I've been working on a change to its core: processing this dependency graph can sometimes be slow, and I want it to be very fast.
The dependency graph
This graph holds fields (individual pieces of state), collections (arrays of state), and calculations (functions that read other state and return a value).
When state changes, nodes in the graph are marked as "dirty" and dirty nodes in the graph are visited in topological order—if a dirty node has a value that has changed, the "dirty" bit propagates downward to nodes that depend on it.
This propagation is what powers Gooey, allowing its calculations to re-evaluate when their dependencies have changed and allowing state to be bound directly to the DOM.
For this to be efficient and correct, the graph needs to be processed in topological order. Dependencies need to be updated before the thing that depends on them can be updated.
So it must maintain a topological ordering of the graph, even when nodes/edges are added/removed. It needs to know that when the graph has changed shape, how to reorder the topological ordering to reflect this new shape.
And traversing the full graph to build this ordering is a linear-in-time operation with respect to the number of nodes & edges within the graph.
Linear is pretty good, but when you have large graphs, even linear operations can be slow.
Say you have a dependency graph with 10k nodes and 30k edges. If one additional node is added, it'd be disappointing if the engine had to visit all 40k elements to rebuild the full topological ordering.
Internally, the engine optimizes for small changes by being able to reuse the previously calculated ordering when building an updated topology. It only reorders the part of the topology that is related to the changes, which can significantly cut down on the number of elements visited.
But unfortunately there's a common operation that makes this optimization unusable: adding items to collections.
What is the topology of a collection?
When you have a collection of items that are rendered to JSX, like this:
The graph that is constructed in the initial state looks something like this:
The arrows represent the data dependencies, pointing in the direction that data flows.
Hopefully the structure makes sense. If you were to change the value of one of the numbers, the corresponding node gets
dirtied, the dirtiness "flows downward" following the arrows, and the corresponding transformed value in the
.mapView(...)
set of nodes gets recalculated.
For added items, imagine a new node being added to the top box, and then flowing downward to the bottom projected box, creating a new node down there.
The "emitter" and "consumer" node are special nodes that know about how to receive and pass forward the array events as these collections are changed.
Maintaining topological ordering is hard
Let's say the "Add number" button is clicked. A new value is pushed onto the end of our numbers
collection.
For the topological ordering to be correct, the new node needs to be inserted before the emitter and the calc
function that draws the last item.
This is unfortunate because inserting an item into the array that holds the topological ordering requires adjusting nearly all items—it shifts everything to the right by one. No optimization can reduce the scope of this insertion.
And this is our bottleneck. I want Gooey to have commonplace operations like adding an item to an array to be fast. This is not fast enough.
To solve this, I'm changing the representation of the graph itself. Today, collections have fine-grained nodes in the
graph, allowing the graph processing engine to propagate dirtiness in a simple (and fine-grained way). So if
something depends on just number[1]
, it will only be calculated if number[1]
changes.
What if instead the graph looked like this?
Instead of having a single node for each item in a collection, each collection is represented as a single node.
This is possible because the topological ordering of nodes in a collection always should be the same: there's never a need for something to be processed between two parts of the same collection.
However, getting to this state is a big change. The "dirty" propagation currently is done just by looking at the shape of the graph. If we reduce all collection item nodes to a single node, the collection needs to know a bit more info about what things are accessing which items and therefore how to propagate the "dirty" bit to the right downstream dependency.
So in our example, the numbers
collection needs to know that the calc
calculation only depends on numbers.length
and numbers[2]
. It doesn't depend on numbers[0]
or numbers[1]
, so when those indexes are changed, it doesn't need
to propagate dirtiness to the calc
calculation.
This means calculations need to keep track of the things that depend on it and which specific keys are needed.
And if calculations know about what depends on them, this also means the "emitter" and "consumer" nodes are unnecessary, which did that mapping.
This is a big refactor, and involves changing some core assumptions in how the graph processes items, especially in the presence of cycles.
I've been making (slow) progress on this, and hope the next version of Gooey includes this optimization, which should significantly improve the performance of small changes to large applications.