3 lessons learned from implementing Raft in Go

3 lessons learned from implementing Raft in Go

As part of my continuous learning journey, I recently picked up a Distributed Systems class from MIT that uses Golang in its labs. The course is great: all the lesson registrations and supporting materials are available online and are summarised in the syllabus. I especially liked the idea that I could learn something new about distributed systems, fine-tune my Golang skills and build something cool by developing the take-home assignments at the same time.

I wasn't disappointed. After taking upon a MapReduce simple implementation, I was ready to tackle the Raft implementation by following the design outlined in the extended white paper.

Delving 1 level deeper

Since I was implementing Raft from scratch, I had to build directly on top of the Go language itself, without any recourse to external libraries or snippets of code. Of course, this means that I gained a pretty close relationship with basic building blocks such as  sync and net/rpc package. Often, to really understand what I was doing I needed to peek into the Go source code itself, an activity which I dismissed as "unnecessary" many times.

One of the advantages of Go is that it's a small language: a beginner with a basic knowledge of its syntax can get pretty far before getting confused. On top of that, the standard packages of Go are really lovely and natively provide many of the goodies (networking, unit testing, logging) that other languages provide only through 3rd party packages. Hence, peeking in libraries like sync is not intimidating as it may seem, because everything is built from a limited set of logical elements.

Developing knowledge of what's happening 1 or 2 levels below my own code has given me a new level of confidence in what I was doing, especially when the documentation would mention some details just in passing and leave my imagination running wild over the possible low-level behavior of a certain library. Interestingly, I started to take a habit to explore and delve into totally different languages and codebases for the pleasure of discovery: in this way, I was able to tackle some bugs that previously looked pretty esoteric.

Concurrency in Go

90% of my committed lines of code are written in Golang. It's easily my main language, used both at work and for leisure coding. However, the use I made of the language in a typical project usually revolves around:

  • designing and implementing RPC APIs
  • coding business logic
  • defining suites of unit and integration tests
  • extending some CLI utility tool

The above activities are all nice and sweet, but it's undeniably a type of programming that sits at a pretty high level of abstraction, leaving many interesting tech problems at the doorstep.

Implementing my own version of Raft allowed me to dive head-first into concurrency without restraints. I already knew about Goroutines and I somewhat used them in the past, but never to the level I had to think too hard about them. Detecting candidates for concurrency it's not the difficult bit, as it's usually obvious that network operations and long-running background tasks deserve their own thread. The challenge is to correctly manage concurrent data access from the various goroutines and figure out how the different threads could be interwoven during the execution.

Locking critical code sections is the job of the standard  sync package. I choose some of the most basic - but effective - tools available: Mutex, Cond and the built-in type chan.

Mutex was my main choice when protecting access to critical data, such as the Raft node's internal state. I adopted a very coarse granularity when locking code, releasing the lock only before initiating I/O operations or explicitly pausing the goroutine. It worked pretty well in order to avoid deadlocks and releasing resources for other threads, even if theoretically some more performance could have been unlocked with a finer management of data access.

Cond was an interesting new finding: instead of waiting for a certain number of threads in a WaitGroup (that may or may not bring the system to a desired state), I could wait for every thread to report to the main process and then immediately proceed if any of them returned the data I was waiting for. Before learning about Cond, I achieved the same goal by inserting additional checks and logic in each goroutine, which is functionally similar but harder to read and understand in my opinion.

Building from a white paper

Building directly from the white paper definitely felt different. There is great pleasure in going directly to the root source of knowledge, avoiding silly 10 minutes YT videos or bloated articles that keep beating around the bush without really going to the core concepts. Additionally, a white paper like the Raft one not only explains the high-level design but also goes on to explain why some features were designed in a certain way.

Knowing the reason for things was really important to me, as it allowed me to make appropriate decisions while keeping some degree of freedom in taking whatever smart (or dumb) decision I felt like taking at any time.

As an added nice bonus, Raft was purposefully designed to be an easy-to-understand consensus and coordination algorithm for distributed systems. While it's not necessarily used in production due to its sub-optimal performances, it deals with crucial concepts and challenges that all distributed systems face, providing a solid foundation that I later used to understand and interpret decisions made in other, more complex algorithms. This is very difficult to do by following a second-hand explanation  of the algorithm: I really understood Raft only after reading the extended white paper AND implementing it by myself

Conclusion

While challenging - and borderline bewildering at times, implementing Raft was an overall great experience that added many practical tools under my belt and improved my proficiency with Go.

Yes - undergoing this project was challenging and time-consuming, even by following hints and re-reading documentation multiple times. Still, the final satisfaction of seeing all the tests pass is immense. Moreover, I feel that the new knowledge I acquired is already overflowing to other projects and even to work-related stuff, so the time I invested in this is already paying me back with interest.