Creating Process Continuity in Distributed Elixir

Ryan Spore

December 12, 2022

blog

Kohort is an opinionated remote meeting platform we built recently and use Daily at Mechanical Orchard. It lets distributed teams have meetings with more structure, transparency, and connection, with an integrated meeting agenda, live transcription, AI-summarization, and much more.

‍

Whenever you join a meeting via Kohort, we spin up an Elixir process that will live for the length of the meeting. It does a variety of things, from sampling the transcript for keywords, to orchestrating API calls as the meeting starts and ends. We need this process to have continuity throughout the length of a meeting (and meetings can last a long time amirite?). But machines die, deploys happen, and without care, our Meeting process continuity is lost.

‍

If you also have long running processes, you will also face this problem: What happens to a long lived process when the machine it’s running on dies?

‍

We really wanted the answer to that question to be “A new process starts immediately on another machine, picks up from where the dead process stopped, and everyone else starts talking to the new process without even noticing the difference.”

‍

We got pretty close! I’ll tell you how.

‍

Basic Setup

We started with Horde. Horde provides distributed DynamicSupervisor and Registry implementations, which we'll use to supervise our long running processes and maintain their uniqueness.

‍

	defmodule MyApp.Application do
	use Application

	def start(_type, _args) do
	children =
	[
	# ... other processes in your application
	{Horde.Registry, [name: MyApp.DistributedRegistry, keys: :unique, members: :auto]},
	{Horde.DynamicSupervisor,
	[
	name: MyApp.DistributedSupervisor,
	strategy: :one_for_one,
	shutdown: 10_000,
	members: :auto
	]},
	# ... other processes in your application
	]
	# ... other startup work
	Supervisor.start_link(children, opts)
	end
	# ... other application logic
	end

view raw creating_process_continuity_in_distributed_elixir_code_sample_1.ex hosted with ❤ by GitHub

‍

Let’s start with a long running process GenServer skeleton, something like this:

‍

	defmodule MyApp.Process do
	use GenServer, restart: :transient

	def start_link(name) do
	via_tuple = {:via, Horde.Registry, {MyApp.DistributedRegistry, name}}
	case GenServer.start_link(__MODULE__, name, name: via_tuple) do
	{:ok, pid} -> {:ok, pid}
	{:error, {:already_started, _pid}} -> :ignore
	end
	end

	@impl GenServer
	def init(name), do: {:ok, name, {:continue, :post_init}}

	@impl GenServer
	def handle_continue(:post_init, name) do
	state = initial_state()
	{:noreply, state}
	end

	@impl GenServer
	def handle_call({:trigger_manual_shutdown, shutdown_signal}, from, state), do:
	{:stop, shutdown_signal, :ok, state}

	@impl GenServer
	def terminate(other, state), do: :ok

	def via_tuple(name), do:

	# ... the work your process does ...
	end

view raw creating_process_continuity_in_distributed_elixir_code_sample_2.ex hosted with ❤ by GitHub

‍

It’s mostly a basic structure for a GenServer, but we’re using a :continue-d startup, and we’re registering the GenServer with our DistributedRegistry.

You can start it with our Horde supervisor:

‍

	Horde.DynamicSupervisor.start_child(MyApp.DistributedSupervisor, {MyApp.Process, name})

view raw creating_process_continuity_in_distributed_elixir_code_sample_9.ex hosted with ❤ by GitHub

‍

Cluster Setup

To have meaningful handoff, you need your application to operate as a cluster. We use libcluster's kubernetes strategies to get our BEAM nodes talking together (see this blog post for an excellent guide for how to get those to play nicely).

‍

Once you’ve got the application clustering, you can set up Horde to supervise your long lived processes.

‍

In this state, if the node your process is running on died, Horde would restart it on a different node, but with the same starting state that was provided in the start_child call.

‍

Graceful State Handoff

In order to transfer state on the new node, we need to handle the exit signals Horde uses when a node dies. First, we need to trap exits:

‍

	def init(name) do
	Process.flag(:trap_exit, true)
	{:ok, name, {:continue, :post_init}}
	end

view raw creating_process_continuity_in_distributed_elixir_code_sample_3.ex hosted with ❤ by GitHub

‍

Here we intercept the normal shutdown message triggered by the node shutting down, and write some handoff state to the Horde. Registry meta, and then sleep to give the Registry a chance to sync with other nodes.

‍

Your application needs to have some shared datastore that both dying and spawning nodes can communicate with. This could be a database or other external datastore, but we chose to use the Horde. Registry meta, which is kept synchronized across a cluster using a CRDT. In order to use this effectively, we needed to ensure our nodes continued to be clustered together until they shut down completely to enable this handoff, but were removed from the load balancer so users didn't connect to the dying nodes. Using the pods lookup mode, we were able to keep pods clustered during their shutdown, after they've already been removed from the endpoint managing load balancing.

‍

If your long running process isn’t eternal, you need an alternate way to shutdown the process that clears the handoff state, like this:

‍

	def terminate({:shutdown, :finished}, state) do
	Horde.Registry.put_meta(MyApp.DistributedRegistry, state.name, nil)
	{:noreply, state}
	end

view raw creating_process_continuity_in_distributed_elixir_code_sample_6.ex hosted with ❤ by GitHub

‍

Clumsy State Handoff

Graceful shutdown isn’t guaranteed, so to hedge against a more clumsy handoff, we write as much state as is available eagerly, and add a more complex state recovery step. You can try reconstituting state from a database or event log.

‍

	def handle_continue(:post_init, name) do
	state =
	case Horde.Registry.meta(MyApp.DistributedRegistry, name) do
	{:ok, {:graceful_handoff, state}} ->
	state
	{:ok, {:clumsy_handoff, state}} ->
	attempt_state_recovery(state)
	_ ->
	initial_state()
	end
	Horde.Registry.put_meta(MyApp.DistributedRegistry, state.name, {:clumsy_handoff, state})
	{:noreply, state}
	end

view raw creating_process_continuity_in_distributed_elixir_code_sample_8.ex hosted with ❤ by GitHub

‍

With all this in place, you have a system that moves long running processes off of dying nodes, and uses the best available information to keep them running from where they left off.

‍

Further Improvements

Horde's distribution is eventually consistent, so sometimes duplicate processes will get started. You can look into merge conflict resolution to handle the messages Horde uses to resolve these conflicts.

‍

You can also enhance your clumsy handoff recovery by adding periodic writes to the clumsy handoff state or improving the state recovery logic when a new node attempts a clumsy recovery.

‍

Elixir has proved to be an excellent (and enjoyable!) tool for building a modern, resilient, scalable RTC application. Phoenix and Liveview allow us to build responsive app quickly and OTP allows us to build a scalable and fault tolerant video call experience. It’s been a pleasure working in this ecosystem to deliver an innovative product that’s always evolving and improving.

‍

Conversation

Footnote

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

0 Comments

Author Name

Comment Time

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere. uis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Delete

Author Name

Comment Time

Delete

Subscribe
to our newsletter

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Link copied!

Creating Process Continuity in Distributed Elixir

Basic Setup

Cluster Setup

Graceful State Handoff

Clumsy State Handoff

Further Improvements

Conversation

Footnote

Leave a comment

Next Article

You might like

Mechanical Orchard: A New Company With a Long History

On a Limb: The risk of poor quality software with Professor Herb Krasner

Securely Connect All of Your GCP Cloud SQL Clients in 15 Minutes

$1.14 Trillion to Keep the Lights on: Legacy’s Drag on Productivity

$13 Trillion of the US GDP Rides on Mainframes

Beyond Resilience: Book Launch Party

Building confidence as a competency

Subscribe
to our newsletter

Creating Process Continuity in Distributed Elixir

Basic Setup

Cluster Setup

Graceful State Handoff

Clumsy State Handoff

Further Improvements

Conversation

Footnote

Leave a comment

Next Article

You might like

Mechanical Orchard: A New Company With a Long History

On a Limb: The risk of poor quality software with Professor Herb Krasner

Securely Connect All of Your GCP Cloud SQL Clients in 15 Minutes

$1.14 Trillion to Keep the Lights on: Legacy’s Drag on Productivity

$13 Trillion of the US GDP Rides on Mainframes

Beyond Resilience: Book Launch Party

Building confidence as a competency

Subscribe to our newsletter

Newsletter

Subscribe
to our newsletter