Gradual Epiphany

1Nov/091

Thoughts on Dynamo’s “flawed architecture”

Posted by dizzyd

In general, I think it's a little inflammatory to make sweeping statements about the fitness of a given architecture. Every architecture has its flaws; it's an expected state when you are faced with diametrically opposing constraints. The real question that should be asked is whether or not an architecture solves the problems for which it was designed in a reliable and efficient manner.

Joydeep Sarma posted an entry claiming that Dynamo is a "flawed architecture". I'm not really qualified to prove or disprove Mr. Sarma's claim, but having implemented a Dynamo clone, I think that he may be a little confused about how things work in these systems. What follows are a few quotes from his write-up followed by my own responses.

Let’s say that one is storing key-value pairs in Dynamo - where the value encodes a ‘list’. If
Dynamo returns a stale read for a key and claims the key is missing, the application will
create a new empty list and store it back in Dynamo. This will cause the existing key to be
wiped out. Depending on how ’stale’ the read was - the data loss (due to truncation of the
list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded
data loss - not even in the case of a Disaster.

Dynamo implementations protect against this scenario by using vector clocks. If we define a "stale read" as one which returns the key (or absence thereof) and an older vector clock, then any writes which use this older/non-existent vector clock will generate a conflict and the server will store two versions of the same key. The application then has the opportunity to resolve this conflict on the next read. When used in conjuction with quoroms for reads and writes, this approach proves to be exceedingly robust.

Dynamo starts by saying it’s eventually consistent - but then in Section 4.5. it claims
a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting
the number of reads (R) and number of writes (W) to be more than the total number of replicas
(N) (ie. R+W>N) - one gets consistent data back on reads. This is flat out misleading. On close
analysis one observes that there are no barriers to joining a quorum group (for a set of
keys). Nodes may fail, miss out on many many updates and then rejoin the cluster - but are
admitted back to the quorum group without any resynchronization barrier. As a result, reading
from R copies is not sufficient to give up-to-date data.

One of the foundational assumptions in the Dynamo system is that you define as many replicas as necessary to achieve your desired level of reliability. As with any replication based system, if you lose all of your replicas, there is no meaningful recovery. However, if we assume that you will always have some number of replicas functional, and we introduce an appropriate quorum on operations, we can identify those nodes which return stale data and repair them appropriately. In other words, it's perfectly possible not to have resync barrier on joining, yet still ensure consistency in the answers provided to the client.

It might be helpful to recall that there are three levels of repair: read-repairs, hinted handoffs and replica synchronization. Two of these three are done in near-real time, thus minimizing the actual drift between nodes. Read repair deals with stale data on a per key/operation basis; the coordinator for a request can identify nodes responding with stale data and update them accordingly, using responses from other less stale nodes. Hinted handoffs are a bulk operation that is done when a node rejoins the cluster -- the keys updated while the node was down are replayed (in essence) to the rejoining node. Replica sync is something that is typically done once a day and does require a traversal of all the data for a given partition. Tricks like Merkel trees, however, permit only the changed portion of the data to be exchanged, so in practice it's not nearly as expensive as one might imagine in the abstract.

Lack of point in time consistency at the surviving replica (that is evident in this scenario)
is very problematic for most applications. In cases where one transaction (B) populates entites
that refer to entities populated in previous transactions (A), the effect of B being applied to
the remote replica without A being applied leads to inconsistencies that applications are
typically ill equipped to handle (and doing so would make most applications complicated).

The Dynamo paper makes it very clear that applications do require more logic to deal with these situations. Yes, it's more work for the application, but in practice, it's not that bad. It's also important to point out that data dependencies are handled differently in these key/value stores than they are in a typical ACID environment. Usually apps will store the data in a denormalized form, so dependencies amongst key versions are minimal (if they exist at all). This makes it much easier to deal with conflicts as all the relevant data is on hand during the resolution phase.

I'll leave it to someone else to do a more exhaustive analysis of Mr. Samra's arguments. It's been my experience over the past 2 years that Dynamo is one of those systems that you really have to see in action (or implement it) to appreciate the wonderful elegance and resiliency of the design. It's certainly not a one-size-fits-all solution, but works very well in the appropriate problem space.

Filed under: General 1 Comment
19Aug/091

Getting started with erlbox

Posted by dizzyd

erlbox is a set of Rake tasks that make it easy to build Erlang applications and embedded nodes. It's a framework that Phil and I developed over the past few months and is now something I'd prefer not to live without. While it would be nice to have a "pure" Erlang solution for doing builds, Rake has turned out to be an excellent tool and a reasonable, pragmatic solution to the problem.

Please note that erlbox (and this blog entry) isn't necessarily where you want to start when you're first learning Erlang -- see any of the excellent books for a good introduction to Erlang.

To get started with erlbox, you first need to install Ruby and RubyGems. One you have RubyGems installed, you can then do:

$ gem install erlbox

This should pull down erlbox as well as Rake and any other dependencies. Note that the Debian version of RubyGems is a little weird -- my experience is that using RubyGems from source on Debian yields the best results.

Once you have erlbox installed, you're ready to put together your an Erlang application. In keeping with OTP guidelines, we'll start by creating a standard OTP directory structure:

$ mkdir -p testapp/ebin testapp/src

The next step is to create an application descriptor (ebin/testapp.app) so that Erlang/OTP knows how to start our application up. Drop the following text into testapp/ebin/testapp.app:

{application, testapp,
 [{description, "Test Application"},
  {vsn, "1"},
  {modules, [ testapp,
              testapp_sup ]},
  {registered, []},
  {applications, [kernel,
                  stdlib]},
  {mod, {testapp, []}},
  {env, [
        ]}
 ]}.

The hows/whys of OTP .app files are beyond the scope of this entry -- see the Erlang docs, specifically "Working with OTP/Design Principles" for more details.

With the .app file and the basic directory structure in place, we are now ready to create the Rakefile that will be used by Rake to kick off the erlbox tasks. For this simple build, it's a one-liner:

$ echo "require 'erlbox'" > testapp/Rakefile

We can now use Rake to build our Erlang app:

$ cd testapp; rake
(in /home/dizzyd/src/testapp)
validating testapp.app...
rake aborted!
One or more modules listed in testapp.app do not exist as .beam:
 * testapp
 * testapp_sup

(See full trace by running task with --trace)

As you can see, we still have some work to do. One of the important things that erlbox does is validate the application descriptor ebin/testapp.app and ensure that all the modules it lists are present in compiled form in the ebin/ directory. In this case, the .app file claimed that a module named "testapp" would be present as ebin/testapp.beam, and erlbox generated an error when the module was not found.

So, let's create the source for the testapp module. Drop the following text into testapp/src/testapp.erl:

-module(testapp).

-behaviour(application).

%% Application callbacks
-export([start/2, stop/1]).

%% ===============================================================
%% Application callbacks
%% ===============================================================

start(_StartType, _StartArgs) ->
    testapp_sup:start_link().

stop(_State) ->
    ok.

We also need to define the application supervisor in testapp/src/testapp.erl:

-module(testapp_sup).

-behaviour(supervisor).

%% API
-export([start_link/0]).

%% Supervisor callbacks
-export([init/1]).

%% ===================================================================
%% API functions
%% ===================================================================

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

%% ===================================================================
%% Supervisor callbacks
%% ===================================================================

init([]) ->
    {ok,{{one_for_one,5,10}, []}}.

The erlbox tasks know to look for all source files in testapp/src/*.erl and compile them to testapp/ebin/*.beam. With these files in place let's try rake again:

$ cd testapp; rake
(in /home/dizzyd/src/testapp)
compiling src/testapp.erl...
compiling src/testapp_sup.erl...
validating testapp.app...
$

The build completed cleanly. Congratulations, you've just built a basic Erlang/OTP application using erlbox.

There are a lot more features in erlbox other than what I've covered here. It also has the ability to build/compile OTP embedded nodes, SNMP MIBs, port drivers and other everyday components that make up the OTP platform.

Filed under: Erlang 1 Comment
20Dec/080

Drained

Posted by dizzyd

The last portion of this year has been draining on many fronts. It's not a complaint -- just a statement. I was grading two classes, taking another and working full time. It was too much. I worked through everything, but at different times had to neglect things that I would have preferred to focus on.

I am slowly recovering. Now in this slow, happy time of Christmas and New Year's I find myself without my normal drive. I feel empty and light; it's disturbing after the harried pace of the past 4 months. There are so many side projects that I want to work on, but simply can't find the energy or desire to focus on them.

Over the years I've come to realize that the creative expenditure of creating software comes at a price. I have a fixed capacity for creating software -- if I expend that capacity it requires time to refuel. In the interim, I can still create but at a much diminished pace, and typically with a much lower quality than what I am accustomed to. The best thing to do, typically, is NOT create. Wait, pause and be patient. Permit focus to drift until it's ready to snap back to laser precision for the next Push.

This post probably sounds like nonsense -- perhaps it is.

Filed under: General No Comments
13May/080

GTD and clarity

Posted by dizzyd

I've recently been bitten by the "GTD":http://en.wikipedia.org/wiki/Getting_Things_Done bug. I'm not exactly a disorganized person -- I generally do get stuff done. What attracted me to the system is the core idea of striving for clarity of thought by eliminating (brain) clutter.

I've always loved the feeling that I get when I lose myself to a particularly challenging or fun piece of coding. It's that state of mind where you lose track of the passage of time and focus all your energies on turning ephemeral ideas into billions of electronic pulses. There is a clarity of thought in that state, and I would love to experience it more often.

The problem is, there is always clutter and noise. So, the logical question is, how does one eliminate these things and encourage a more constant state of clarity?

For myself, I've found that GTD is at least a starting point. It provides a framework on which to capture actions and ideas in a way that shunts the responsibility for tracking stuff from my brain to a more reliable store. As I've been consistently doing this for the past week, my list of actions/projects has grown far more rapidly than I would have ever thought. The amount of stuff that we juggle in our heads is truly prodigious -- no wonder the average attention span in our society is under 3 minutes.

Filed under: General No Comments
14Mar/080

Breathe

Posted by dizzyd

Digits click,

Neuron to circuit, ideas flow;

Software breathes.

Filed under: General No Comments