Distributed Systems, part 1

2013-06-09

I recently enjoyed a delightful evening talking with the NashFP crew about distributed systems. I’m very, very new to this space, having just become aware of NoSQL, Riak, etc last fall as I learned more about Erlang, so what I don’t know would fill several libraries.

But I realized on my drive to Nashville that after just over four months now as a technical evangelist with Basho, I have been exposed to quite a bit more knowledge than I realized, so I discarded my original plans for the talk and instead held a dialog on some of the basics.

Here is a relatively brief recap of some of the ideas we discussed that night.

What’s a distributed system?

We didn’t explicitly talk about this that I recall, and I suspect you can find many different definitions, but virtually every system is distributed in today’s world. One frequently-quoted aphorism attributed to Leslie Lamport works very well:

You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.

Distributed systems are at the nexus of most of the interesting conversations of our era: cloud and mobile computing, big data, social networks. This is hard stuff and most of the work has yet to be done, so if you’re still deciding what you want to do in the field, there are far worse ambulances to chase.

Leslie Lamport?

Yes, he’s the La in LaTeX, but more importantly, he’s been thinking and writing about concurrency and distributed systems since (at least) the early 1970s.

No, that’s not a typo. He’s been at this for a long time, and is a giant in the field.

CAP theorem

It’s almost impossible to not encounter the CAP theorem. One of the better introductions to the topic is Eric Brewer’s article CAP Twelve Years Later.

My short take on CAP: when the network breaks, it is impossible for systems on both sides of the partition to both continue to be available for updates and to be in complete agreement on the state of the world.

This theorem shapes many of the tradeoffs that consensus frameworks explicitly must deal with.

Aren’t networks highly reliable?

If you think this question is more academic than useful because networks are reliable (and even if you don’t), I cannot recommend highly enough an article by Kyle Kingsbury and Peter Bailis titled The network is reliable. Fair warning: be prepared for a long read.

Types of distributed systems

Highly available via eventual consistency

Eventual consistency is yet another concept that is widely discussed (and just as widely misunderstood).

Once again, the short version: given that CAP is an unforgiving mistress, given that many systems must remain available even when servers have failed or the network is experiencing a partition, and given the global and massive scale of some of today’s most prominent network services (e.g. Facebook & Twitter), in many cases it’s perfectly acceptable for systems to be slightly out of agreement as long as they achieve consensus eventually.

The ubiquitous example of this is DNS. The Internet been dealing with eventual consistency in DNS for longer than many readers have been alive.

Basho’s Riak Core is a distributed systems platform that falls in this space and about which I’ll be writing much more.

Strong consistency

At the other end of the spectrum are tools and protocols that attempt to achieve perfect consistency across entire clusters of servers, such as ZooKeeper, Raft, and Paxos.

There’s a cost to this: a quorum of servers must be available, and achieving this level of consistency results in slower responses to requests.

The rest

Most distributed systems are neither highly available (leveraging eventual consistency) nor strongly consistent; instead, they’re ad hoc and buggy.

Even systems that take advantage of ZooKeeper or Riak Core will typically not be as strongly consistent as the former or as available as the latter, because they have custom glue code and external dependencies that aren’t as robust.

But wait, there’s more!

That’s probably the first 20 minutes of our conversation in Nashville. Expect more here in another post, but in the meantime, if you want to start diving in, places to begin:

Post-mortems often offer valuable insights into how production systems fail, and the tech world has gotten much better over the years at opening the kimono. Some lists:
The HighScalability Blog is quite interesting.
Twitter, Twitter, Twitter. Most of the prominent names in technology are actively engaged in conversations there, and it’s a great way to find a place in the distributed systems community, if you’re willing to invest the time.

riak dynamo distsys nashfp

Hard work, done stupidly The joys of doing the simple thing

NoSQL is dead. Long live NoSQL! Don't let media or marketing drive your tech decisions