Practical State Machinery

As abstractions go, finite state machines represent a bit of low hanging fruit when you have real world problems to solve. The jargon can be a little forbidding-MDN leads with "a mathematical abstraction used to design algorithms" and pretty much gets more technical from there-but in reality they represent a simple and practical technique for structuring code. This post is aimed at working engineers in the web and mobile world who want to understand the practical uses of state machines as a tool for software design. To that end, we'll focus on concrete examples. The MDN link above provides a definition for the term "state machine;" we're going to use it in a sentence.

First question: What is a "state?" It's a condition something is in, where that thing is always in some condition out of a finite list at any given time. A car's transmission might be in park, reverse or drive but not (we hope!) more than one at once. In software, the thing in question might be a blog post that's in draft, published or deleted; it might be a user interface element that shows one pane on loading, one on success and another on error, with an option to retry; it might be code deep in the network stack, ensuring handshakes are executed correctly. Many things you will have to model in software have states of this kind, implicitly or explicitly.

"State machines" tell you, for a given thing, what state it should start in, and what the rules are for moving from state to state. For programmers, they're a basic and powerful tool for imposing order on chaos. The benefits of state machines often stem from their declarative nature:

Validity: Many bugs either never arise in the first place or turn into explicit errors with state machines. For example, an operation can be made to fail because it was called out of order, rather than being permitted to run with a nil value that it expects an earlier operation to have set.
Clarity: Machine definitions are relatively easy to visualize mentally, and literally using tools like graphviz.
Extensibility: Machine definitions are also relatively easy to extend after the fact with new events and states. This is an important point because it's reasonable for developers to worry that a state machine represents too much of an upfront design commitment. In cases where the machine represents complex business logic, the upfront commitment may well pay off later, when extending (or fixing) that logic is a matter of updating or creating a few definitions, rather than untangling a bowl of spaghetti code.
Polymorphism: Software entities can, in general, dispatch to different implementations based on their current state, or even import entirely new interfaces. This can be tremendously useful, say, for conditional validation.

In terms of implementation, state machines can take many forms. They might be abstracted by a software package, or you might roll your own, or you might just refer to some chunk of code as a "state machine" because that's what it does. I'll give examples in JSON, as objects of the following shape:
initial: The state the machine starts in, given as a string.
transitions: An object specifying how the machine changes. The keys of this object are events, given as strings. The values are themselves objects, where the key is a current state the machine might be in, and the value is the next state that event would put the machine in if it was.

Here's an example that models a light switch:

{
  "initial": "off",
  "transitions": {
    "switch": { "off": "on", "on": "off" }
  }
}

This data structure, paired with some simple code to integrate it with your application, tells you pretty much everything you need to know about operating a light switch. It also provides a template for replacing (say) long blocks of conditional logic with something more declarative.

Examples

Wizards

Multipage forms, sometimes called "wizards", are a common design pattern. They can also create headaches for web developers, as they require custom forms and logic based on the state of the user's interaction. Here's a typical and relatively simple checkout wizard you might use in an online store:

{
  "initial": "cart",
  "transitions": {
    "checkout": { "cart": "shipping" }
    "advance": {
      "shipping": "billing",
      "billing": "payment",
      "payment": "confirmation"
    },
    "confirm": { "confirmation": "accepted" },
    "cancel": {
      "shipping": "cart",
      "billing": "cart",
      "payment": "cart",
      "confirmation": "cart"
    }
}

That is:

The initial state will be cart and will remain so while the user shops.
The checkout event can only occur while in state cart, and when that event happens, we move to shipping.
The user may advance through the form when it's in state shipping, billing, or confirmation.
From state confirmation alone, the user can confirm the order, at which point the payment will be run and the form will enter accepted, a terminal state (because there's no event that leads out of it to another state).

If the user has started the checkout process but hasn't finished it-i.e., if we're in any state other than cart or accepted-we can cancel the checkout and continue shopping.

Imagine, in your language of choice, implementing the above with conditional logic. The user form itself must be rendered based on the state of the order, as will client- and server-side validations and side effects of successful operations, like sending confirmation emails or reindexing customer data. That's four sites-rendering the form, running client-side validations, running server-side validations, dispatching backend side effects-where you'd potentially have if or case or switch statements to maintain. Imagine also that the business team will inevitably come asking for a new screen in the middle of the wizard, with its own form, validation, side effects and so on, so all of this logic should be easier to refactor, for your own health if nothing else.

Conditional logic to select a form component using JavaScript might look like this:

switch (currentPage) {
  case "billing":
    return 
  case "shipping":
    return 
   ...
}

This code breaks encapsulation by making two things its own responsibility: the iteration logic (the list of case statements) and the dispatch logic (the connection between, e.g., "billing" and BillingForm). Suppose again we have four places where this or something like it happens in the application. Because the iteration logic is foisted on the caller, that's four different lists of possible states. While one is updated, others might not be. From a design perspective it's simply not the caller's job to specify what the possible values are for the form, or what the connection is between those values and actual form components.

It would be great if we could abstract/centralize both of those things, maybe by writing a little function like this:

// like this:
getStateByName(currentPage).component
// or this:
getStateByName(currentPage).validators

This hypothetical function would take the name of the current page and return an object that provides what the caller really wants as simple keys. We'll call these "state objects." Callers don't have to iterate over the possible states of the form, and state-specific information, like what form component to use, can stay encapsulated:

/* states/billingStage.js */

import { BillingForm } from FormComponents

export default const billingState = {
  pageName: "billing",
  component: BillingForm,
  validations: {
    field1: (value) => { ... }
    ...
  }
}

Now all you need to implement getStateByName is a registry of state objects, keyed by pageName.

The next page in the form is also dependent on the current state; when you leave shipping you go to billing, whereas when you leave billing you go to payment. Once you've started the checkout process you can cancel it, but not after you've paid for it, i.e., if the state is accepted. This logic will be used by the server to update the state, and by validators to tell the user if what they're trying to do is permitted. Let's add a transitions key to our state object above to say what actions are possible, and what next state they will result in:

/* states/billingState.js */

export default const billingState = {
  ...
  transitions: {
    advance: "payment",
    cancel: "cart"
  }
  ...
}

Congratulations, in the process of cleaning up conditional logic you have essentially implemented a state machine. With the addition of an initial state, (which you have already implemented elsewhere in your application, if the feature works at all) the transitions objects of each state can easily be combined and transformed into the format above.

The advantage of the format above, and of thinking in terms of state machines from the jump, is that it gives you an efficient, high-level representation of things you were probably going to do anyway, one way or another.

Going Straight to the Cloud

In the bad old days file upload was just a particularly wacky case of form submission, but not any more. As more applications make use of object storage services, it often makes sense for the user to upload the file directly to the object storage provider itself. For example, this saves the outbound bandwidth cost of transferring the file to object storage from your application servers. The question is, how do you track this in your application, now that the application servers never see the file?

With help from the client, the application servers can track the state of the upload:

{
  "initial": "pending",
  "transitions": {
     "begin": { "pending": "in progress" },
     "reject": { "pending": "rejected" },
     "done": { "in progress": "uploaded" },
     "fail": { "in progress": "failed" },
     "replace": { "uploaded": "in progress" }
   }
}

The user clicks a button to add a new file, triggering a call to the server for a new upload record. The server creates the record in state pending and replies with a pre-signed URL in the object storage domain for the client to upload the file to. The client begins sending the file to the presigned URL, and sends the application server a begin event, which updates the upload record. When the upload is complete, the client sends done. If the pre-signed URL is rejected, the client can send a reject event, or if the upload fails for some reason, it can send fail. The client can replace a file that's been successfully uploaded, but not one that's failed because that doesn't make sense. State machines are a serializable, language-agnostic way for clients and servers to handle complex tasks over stateless media like HTTP.

As in the previous case, the state machine itself is basically just a very compact representation of the business logic. That is: if you implement the business logic, there's a good chance you've implemented some kind of state machine, even if only implicitly. Using state machines by design makes the logic clear, and provides you with a consistent interface across your application.

Automating workflows

Many applications also require substantial amounts of automation. A CI platform might have to respond to code pushes by building a Docker image, scheduling a container and running an automated test suite, for example, and every step will require customized error handling. State machines are ideal for modeling such job-based workflows.

Let's extend the second example to handle background work you might need to perform on an image upload:

{
  "initial": "pending",
  "transitions": {
     "begin": { "pending": "in progress" },
     "reject": { "pending": "rejected" },
     "fail": { "in progress": "failed" },
     "done": { "in progress": "compression queue" },
     "compress complete": { "compression queue": "thumbnail queue" },
     "thumbnails complete": { "thumbnail queue": "ready" },
     "replace": { "ready": "in progress" }
   }
}

In this case, when the client reports done the server will set the state. Let's suppose that state lives in a cell in a database row, and that database is Postgres. Postgres has a feature called NOTIFY that can be used together with trigger functions to emit notifications on table updates. The NOTIFY command requires two arguments, the channel and the payload. Let's suppose further that we've configured a trigger function to emit a message to channel upload-<upload id&gt; whenever an upload record's state changes, with the new state as the payload.

Next, either using an off the shelf solution or your own concoction, get your application listening to Postgres notifications. Here's an example under Phoenix/Elixir from Kamil Lelonek that I found useful the first time I implemented this pattern. The idea is to associate certain states (say, compression queue and thumbnail queue) with job queues in your application. When your application receives a message to upload-12345 with payload compress, it knows that Upload 12345 is ready to be shrunken from the massive TIFF sent by the user to something more manageable, and it can dispatch a worker to do so. When the worker is done, it sends the compress complete event; the code integrating the state machine with your application updates the state to thumbnail queue, and the process repeats with the thumbnail generation queue. Finally, when the record enters the ready state, a notification is sent to clients that a new upload is in place, and no further processing occurs. (While the workers will have to pull the file from object storage, you can generally do so on nodes in the same infrastructure as your object storage and avoid being charged outbound bandwidth. Your main application servers can still be wherever you want, because the worker nodes need only communicate messages like compress complete back to your database, the bandwidth for which is trivial compared to slinging the file back and forth.)

If state changes trigger workers, and workers can make state changes, you can build automated workflows of arbitrary complexity. It bears mentioning that these workflows include error handling. The state machine can, for example, permit a disk full error event when the image is being processed, but not when it's in state ready, when its disk usage doesn't change. It also bears mentioning that workers can link different state machines across the application. On a disk full error, the image worker might respond by putting another record governed by a different state machine into state purge to call a worker to clear space.

When do you not want to use this?

We said above that the kind of "state" we're interested in here are kinds where an object is in one state at a time from a finite list. What does that rule out? Where wouldn't you see that?

The object has relevant state along more than one dimension: Symptoms of this are where you start to condense multiple real world concepts into a single state field. Baseball games can happen at home or on the road, day or night, during any month of the season, etc. If you started modeling that and wound up with states like june day home game, you may be trying to do too much with a single state attribute. On the other hand, if the dimensions are basically independent of each other, it may work to associate multiple state machines with a single entity. Returning to car transmissions as above, the transmission might be in park, drive, etc. along one dimension, it might be brand new or shopworn along another.
The object doesn't have any relevant state at all: A basic web page with no drafting functionality probably doesn't call for a state machine. It has state, i.e., its contents, but doesn't really move through a finite series of specific states; the usual CRUD model is sufficient to capture its dynamics.
The changes in question are quantitative, rather than qualitative: If the state in question comes in the form of a number, it might not be a great candidate for this abstraction. For example, you could consider every value along the Celsius scale a state, and a panoply of events like increase_by_one_degree, increase_by_two_degrees, etc., but would that really be a good use of your time? (If on the other hand you're ultimately tracking temperature because you want to know if a given piece of matter is solid, liquid or gaseous then you have a good candidate for a state machine operated by the value of another column.

Conclusion

State machines are an essential tool for managing complex behavior, and they are a mercifully simple one to integrate into your practice. The trick is knowing when to use them. There is almost certainly one or more state machine packages for your language of choice, so go see which ones people are using. We've linked to a few below. That, or create your own from scratch and tailor the integration to your application.

Engineering Insights

Examples

Wizards

Going Straight to the Cloud

Automating workflows

When do you not want to use this?

Conclusion

Further Reading

Stay In The Loop

Related Insights

How We Qualified a 50-Page RFP in 20 Minutes Instead of Half a Day

The Gnar Company Launches Free AI Series for B2B Operators, Starting June 30

AI Evals Are Not So Different From the Tests You Already Write

The Gnar is a fire-breathing, Boston-based software partner made of  problem-solvers.

Practical State Machinery

Engineering Insights

Examples

Wizards

Going Straight to the Cloud

Automating workflows

When do you not want to use this?

Conclusion

Further Reading

Stay In The Loop

Related Insights

How We Qualified a 50-Page RFP in 20 Minutes Instead of Half a Day

The Gnar Company Launches Free AI Series for B2B Operators, Starting June 30

AI Evals Are Not So Different From the Tests You Already Write

The Gnar is a fire-breathing, Boston-based software partner made of problem-solvers.

The Gnar is a fire-breathing, Boston-based software partner made of  problem-solvers.