Further thoughts on Dynamo’s “flawed architecture”
Mr. Sarma revisits his claims that Dynamo is a universally "flawed architecture". I certainly concur that Dynamo has its flaws, but making sweeping claims about something being universally so is to under-value the contribution to production thinking that Dynamo contributes. So, once again, I'm going to take a few choice quotes from Mr. Sarma and respond to them.
However, i remain convinced that one should not force clients to deal with stale reads in
environments where they can be avoided. As i have mentioned in the updated initial post - there
are simple examples where stale reads cause havoc. One may not be able to do conflict
resolution or the reads can affect other keys in ways that are hard to fix later.
Arguing applications "may not be able to do conflict resolution" is non-sensical -- by definition, Dynamo requires that the application be cognizant of conflict resolution! This isn't an arbitrary decision to make clients aware of conflicts. It's a part of a measured approach to building a robust system. One may not agree with it, but to claim that Dynamo is universally flawed just because it does not conform with one's personal feeling about layering is dis-ingenous at best.
Please understand me, I make no claim that Dynamo is the end-all-be-all for data stores. It is a terrible, terrible choice for some problem spaces. However, if you want a low-latency, highly-robust key/value store it works quite well.
About Vector Clocks and multiple versions - it’s not a surprise that they were not
implemented in Cassandra. In Cassandra - the cost of having to retrieve many versions of a key
increases the disk seek costs reads multi-fold. Due to the usage of LSM trees, a disk seek may
be required for each file that has a version of the key. Even though the versions may not
require reconciliation, one still has to read them.
This is an argument about implementation details of Cassandra and has nothing to do with whether or not Dynamo is a universally flawed architecture. I can say from experience that vector clocks do not have to be slow -- as with anything, careful implementation can yield surprisingly fast results. I would also note that in the production systems where I've deployed Dynamo-clones, the actual occurrence of multiple versions (or conflicts, in Dynamo terms) is quite rare. The original Dynamo paper (sect 6.3, para 3) notes that 99.94% of all requests return a single version; this matches closely with what I've observed in my own production deployments today (99.91%).
Also, implementation-wise, one doesn't typically keep resolved versions lying around -- the only time there are multiple versions present on disk is when a conflict has not been
resolved. One _could_ keep old versions around, I suppose, and in that situation I agree that you would want to carefully design your store so as to avoid unnecessary seeks when reading the "current" version.
So, unfortunately, i am repeating this yet again - Dynamo’s quorum consensus
protocol seems fundamentally broken. How can one write outside the quorum group and claim a
write quorum? And when one does so - how can one get consistent reads without reading every
freaking replica all the time? (well - the answer is - one doesn’t - which is why Dynamo is
eventually consistent. I just hope that users/developers of Dynamo clones realize this now).
As Mr. Sarma astutely points out, the reason Dynamo works is because it makes no guarantees about instantaneous consistency. Assuming (again) that the client can tolerate conflicts and that the cluster will attempt to resync at the earliest possible opportunity, writing to non-authoritative nodes is perfectly fine. The system will _eventually_ come back into consensus.
Unfortunately, I'm pretty sure that my arguments will be insufficient to convince Mr. Sarma of the utility of Dynamo. I hope, however, that anyone reading this discussion will consider that reviewing the concepts of a paper is a very different task from executing on those concepts. As someone who has successfully executed ideas from that paper, I can assure Mr. Sarma that the concepts not only work, but they work surprisingly well.
Finally, the real contribution of the Dynamo paper is the balance that was struck between performance, reliability and pragmatism in the design of a production DHT. It underscores the importance of taking nothing for granted and being willing to consider counter-intutitive solutions to hard problems.
Thoughts on Dynamo’s “flawed architecture”
In general, I think it's a little inflammatory to make sweeping statements about the fitness of a given architecture. Every architecture has its flaws; it's an expected state when you are faced with diametrically opposing constraints. The real question that should be asked is whether or not an architecture solves the problems for which it was designed in a reliable and efficient manner.
Joydeep Sarma posted an entry claiming that Dynamo is a "flawed architecture". I'm not really qualified to prove or disprove Mr. Sarma's claim, but having implemented a Dynamo clone, I think that he may be a little confused about how things work in these systems. What follows are a few quotes from his write-up followed by my own responses.
Let’s say that one is storing key-value pairs in Dynamo - where the value encodes a ‘list’. If
Dynamo returns a stale read for a key and claims the key is missing, the application will
create a new empty list and store it back in Dynamo. This will cause the existing key to be
wiped out. Depending on how ’stale’ the read was - the data loss (due to truncation of the
list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded
data loss - not even in the case of a Disaster.
Dynamo implementations protect against this scenario by using vector clocks. If we define a "stale read" as one which returns the key (or absence thereof) and an older vector clock, then any writes which use this older/non-existent vector clock will generate a conflict and the server will store two versions of the same key. The application then has the opportunity to resolve this conflict on the next read. When used in conjuction with quoroms for reads and writes, this approach proves to be exceedingly robust.
Dynamo starts by saying it’s eventually consistent - but then in Section 4.5. it claims
a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting
the number of reads (R) and number of writes (W) to be more than the total number of replicas
(N) (ie. R+W>N) - one gets consistent data back on reads. This is flat out misleading. On close
analysis one observes that there are no barriers to joining a quorum group (for a set of
keys). Nodes may fail, miss out on many many updates and then rejoin the cluster - but are
admitted back to the quorum group without any resynchronization barrier. As a result, reading
from R copies is not sufficient to give up-to-date data.
One of the foundational assumptions in the Dynamo system is that you define as many replicas as necessary to achieve your desired level of reliability. As with any replication based system, if you lose all of your replicas, there is no meaningful recovery. However, if we assume that you will always have some number of replicas functional, and we introduce an appropriate quorum on operations, we can identify those nodes which return stale data and repair them appropriately. In other words, it's perfectly possible not to have resync barrier on joining, yet still ensure consistency in the answers provided to the client.
It might be helpful to recall that there are three levels of repair: read-repairs, hinted handoffs and replica synchronization. Two of these three are done in near-real time, thus minimizing the actual drift between nodes. Read repair deals with stale data on a per key/operation basis; the coordinator for a request can identify nodes responding with stale data and update them accordingly, using responses from other less stale nodes. Hinted handoffs are a bulk operation that is done when a node rejoins the cluster -- the keys updated while the node was down are replayed (in essence) to the rejoining node. Replica sync is something that is typically done once a day and does require a traversal of all the data for a given partition. Tricks like Merkel trees, however, permit only the changed portion of the data to be exchanged, so in practice it's not nearly as expensive as one might imagine in the abstract.
Lack of point in time consistency at the surviving replica (that is evident in this scenario)
is very problematic for most applications. In cases where one transaction (B) populates entites
that refer to entities populated in previous transactions (A), the effect of B being applied to
the remote replica without A being applied leads to inconsistencies that applications are
typically ill equipped to handle (and doing so would make most applications complicated).
The Dynamo paper makes it very clear that applications do require more logic to deal with these situations. Yes, it's more work for the application, but in practice, it's not that bad. It's also important to point out that data dependencies are handled differently in these key/value stores than they are in a typical ACID environment. Usually apps will store the data in a denormalized form, so dependencies amongst key versions are minimal (if they exist at all). This makes it much easier to deal with conflicts as all the relevant data is on hand during the resolution phase.
I'll leave it to someone else to do a more exhaustive analysis of Mr. Samra's arguments. It's been my experience over the past 2 years that Dynamo is one of those systems that you really have to see in action (or implement it) to appreciate the wonderful elegance and resiliency of the design. It's certainly not a one-size-fits-all solution, but works very well in the appropriate problem space.