XKCD Explained

How have I not heard of this blog before now? I’ll go with the, “well, because you’re so super-smart you got every joke and never had to have XKCD explained to you.” Yeah… right.

11 August 2011 ·

Netflix Has A Chaos Monkey… and it is Awesome.

The Chaos Monkey has one job - to run around the Netflix servers and services and create chaos. The Chaos Monkey has one job: to run around and randomly kill services within what Netflix calls their Rambo architecture.

Why would a company create chaos within their own working network? Netflix needs each system to expect and tolerate failure from other systems upon which the first system depends. In other words, using the Rambo metaphor, each system needs to be able to succeed, no matter what, even when all alone, in the jungle (okay, no jungle).

For example, if the recommendations system goes down, the entire website should still work showing popular titles rather than personalized picks. For another example, if search becomes slow and bogged down with queries, the streaming systems should still operate perfectly.

This mantra, “If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most - in the event of an unexpected outage.”

Netflix really nailed it with their preparation this past weekend. The massive Amazon EC2 server outage took down many websites including Formspring, Hootsuite, Reddit, Foursquare, Quora, Ow.ly, Zynga and about.me among many others.

Depending on the robustness of each architecture, each website went down to a greater or lesser degree. Some went down altogether, some (like Reddit) were able to keep serving their information but commenting and social functions stopped working.

Despite relying on Amazon EC2, Netflix’s architecture hardly seemed to miss a beat. And that’s because they have a Monkey running around causing chaos.

26 April 2011 ·

About Me

For my professional blog, go to:

Jason Ishibashi

Stuff I Like

See more stuff I like