Gradual Epiphany

20Feb/100

Using Innostore with Riak

Posted by dizzyd

Innostore is an Erlang application that provides an API for storing and retrieving key/value data using the InnoDB storage system. This storage system is the same one used by MySQL for reliable, transactional data storage. It's a proven, fast system and perfect for use with Riak if you have a large amount of data to store. Let's take a look at how you can use Innostore as a backend for Riak.

(Note: I assume that you have successfully built an instance of Riak for your platform. If you built Riak from source in ~/riak, then set $RIAK to ~/riak/rel/riak.")

We first get started by grabbing a stable release of Innostore. You'll need to download the source for a release from: http://bitbucket.org/basho/innostore/downloads/

Looking in the "Tags & snapshots" section, you should download the source for the highest available RELEASE_* tag. In my case, RELEASE_4 is the most recent release, so I'll grab the bz2 file associated with it:

http://bitbucket.org/basho/innostore/get/RELEASE_4.tar.bz2

Once I have the source code, it's time to unpack it and build:

$ tar -xjf innostore-RELEASE_4.tar.bz2
$ cd innostore
$ make

Depending on the speed of the machine you are building on, this may take a few minutes to complete. At the end, you should see a series of unit tests run, with the output ending:

=======================================================
All 7 tests passed.
100222 7:43:58 InnoDB: Shutdown completed; log sequence number 90283
Cover analysis: /Users/dizzyd/src/public/innostore/.eunit/index.html

Now that we have successfully built innostore, it's time to install it into the Riak distribution:

$ ./rebar install target=$RIAK/lib

If you look in the $RIAK/lib directory now, you should see the innostore-4 directory alongside a bunch of .ez files and other directories which compose the Riak release.

Now, we need to tell Riak to use the innostore driver as a backend. Make sure Riak is not running. Edit $RIAK/etc/app.config, setting the value for "storage_backend" as follows:

{storage_backend, innostore_riak},

In addition, append the configuration for the Innostore application after the SASL section:

{sasl, [ ....
]}, %% < -- make sure you add a comma here!!

{innostore, [
{data_home_dir, "data/innodb"}, %% Where data files go
{log_group_home_dir, "data/innodb"}, %% Where log files go
{buffer_pool_size, 2147483648} %% 2G in-memory buffer in bytes
]}

You may need to adjust the directories for your data_home_dir and log_group_home_dirs to match where you want the inno data and log files to be stored. If possible, make sure that the data and log dirs are on separate disks -- this can yield much better performance.

Once you've completed the changes to $RIAK/etc/app.config, you're ready to start Riak:

$ $RIAK/bin/riak console

As it starts up, you should see messages from Inno that end with something like:

100220 16:36:58 InnoDB: highest supported file format is Barracuda.
100220 16:36:58 Embedded InnoDB 1.0.3.5325 started; log sequence number 45764

That's it! You're ready to start using Riak for storing truly massive amounts of data.

Filed under: General No Comments
18Feb/106

1.44 am

Posted by dizzyd

It's 1.44 am. Woke up feeling weird; then my mind went running, afraid of what it might find.

I was diagnosed with follicular lymphoma three weeks ago now.

I'm blessed in a lot of ways. The cancer is slow moving, non aggressive -- or so it appears at this point. I might not even require treatment in the near future. Even if I do require treatment, survival rates have jumped from 60% to 90% in the past five years -- the treatment for this cancer is progressing quickly. My company, Basho, has been wonderful to me in terms of helping me sort out a variety of insurance issues and arranging access to very good doctors.

All of these things are probably the reason I've not had any trouble sleeping until tonight.

It's still scary though. Cancer -- just the word inspires fear when you first hear it. You are struck, relatively quickly, with the fragility and preciousness of life. You suddenly have a deep desire to grow old. The prospect of death is a powerful incentive to live.

I cried more the first few days and weeks than I ever have in my 32 years. I cried because I was scared. I cried because I was worried about my wife, our 2 year old and the new baby on the way. I cried because it felt unfair, unwarranted! I cried because I realized that there were some areas of my life that I had wasted -- and I wondered if I would have the chance to rectify them.

As I've gotten further into this process, emotions have settled out a bit. I realize now just how good I have it with this cancer. What I'm facing is absolutely nothing compared to other people I know with chronic medical conditions. It's a smudge on the screen; a minor distraction. There might be some tough times ahead, but my overall probability for immediate mortality is relatively stable and low.

That said, I'm determined to make the most of this challenge. If I must go through this valley, I'm going to extract every bit of growth from it that I can. I choose to grow, to push my boundaries in every dimension: physically, spiritually, mentally, emotionally. I choose to spend more time with my family and less time with wandering the mental spaces of coding. I choose to listen more and speak less. I choose to be grateful that all of these realizations have been granted to me at 32 instead of 64.

It's now 2.21 am. I think it was just the Chinese food from dinner that woke me up.

Filed under: General 6 Comments
10Jan/103

Rebar

Posted by dizzyd

Over the past two months, I've been busy taking the lessons learned from erlbox and designing a pure Erlang build tool called rebar. While erlbox is a very complete toolkit of rake functions for building Erlang code, it has a couple of significant problems. First off, the external dependency on rake is often a significant problem for developers who are not conversant in Ruby. While anyone can learn Ruby, if you're an Erlang developer you likely have other tasks to attend to than learning a language solely for the purpose of maintaining your build system. The other significant problem with erlbox is that it spends a lot of time going in/out of Erlang to do "Erlangy" sorts of checks -- like parsing/validating the .app file, running eunit, etc. This leads to erlbox being a relatively slow build system, not to mention a little awkward to maintain since it was an odd mix of Ruby and invocations of Erlang.

Thus, rebar was born. As a strictly Erlang implementation, it's possible for Erlang developers to dig into it and improve/modify with minimal effort. It's also wickedly fast, since it starts the VM up only once and has direct access to all the tools one needs to build and validate Erlang code. It has the added advantage of being able to take advantage of Erlang's inherent parallelism, so where possible, it runs commands concurrently. Finally, it's designed to be a self-contained escript, so using rebar doesn't introduce any build dependencies other than a stock Erlang install. You simply drop the rebar script into your code tree and go!

You can see a demonstration of converting an existing app to rebar here.

Create and compile a simple OTP application by doing the following steps on a terminal:

$ mkdir myapp; cd myapp
$ wget http://bitbucket.org/basho/rebar/downloads/rebar; chmod u+x rebar
$ ./rebar create-app appid=myapp
$ ./rebar compile

Documentation is still scarce -- that's something I'm going to be working on over the next few weeks. The core pieces of rebar are mostly at a point that I'm happy with; now it's time to polish. :)

If you have questions about rebar, or especially feedback after using it IRL, please ping me on Freenode IRC -- I'm typically in the #riak room.

Filed under: General 3 Comments
3Nov/091

Further thoughts on Dynamo’s “flawed architecture”

Posted by dizzyd

Mr. Sarma revisits his claims that Dynamo is a universally "flawed architecture". I certainly concur that Dynamo has its flaws, but making sweeping claims about something being universally so is to under-value the contribution to production thinking that Dynamo contributes. So, once again, I'm going to take a few choice quotes from Mr. Sarma and respond to them.

However, i remain convinced that one should not force clients to deal with stale reads in
environments where they can be avoided. As i have mentioned in the updated initial post - there
are simple examples where stale reads cause havoc. One may not be able to do conflict
resolution or the reads can affect other keys in ways that are hard to fix later.

Arguing applications "may not be able to do conflict resolution" is non-sensical -- by definition, Dynamo requires that the application be cognizant of conflict resolution! This isn't an arbitrary decision to make clients aware of conflicts. It's a part of a measured approach to building a robust system. One may not agree with it, but to claim that Dynamo is universally flawed just because it does not conform with one's personal feeling about layering is dis-ingenous at best.

Please understand me, I make no claim that Dynamo is the end-all-be-all for data stores. It is a terrible, terrible choice for some problem spaces. However, if you want a low-latency, highly-robust key/value store it works quite well.

About Vector Clocks and multiple versions - it’s not a surprise that they were not
implemented in Cassandra. In Cassandra - the cost of having to retrieve many versions of a key
increases the disk seek costs reads multi-fold. Due to the usage of LSM trees, a disk seek may
be required for each file that has a version of the key. Even though the versions may not
require reconciliation, one still has to read them.

This is an argument about implementation details of Cassandra and has nothing to do with whether or not Dynamo is a universally flawed architecture. I can say from experience that vector clocks do not have to be slow -- as with anything, careful implementation can yield surprisingly fast results. I would also note that in the production systems where I've deployed Dynamo-clones, the actual occurrence of multiple versions (or conflicts, in Dynamo terms) is quite rare. The original Dynamo paper (sect 6.3, para 3) notes that 99.94% of all requests return a single version; this matches closely with what I've observed in my own production deployments today (99.91%).

Also, implementation-wise, one doesn't typically keep resolved versions lying around -- the only time there are multiple versions present on disk is when a conflict has not been
resolved. One _could_ keep old versions around, I suppose, and in that situation I agree that you would want to carefully design your store so as to avoid unnecessary seeks when reading the "current" version.

So, unfortunately, i am repeating this yet again - Dynamo’s quorum consensus
protocol seems fundamentally broken. How can one write outside the quorum group and claim a
write quorum? And when one does so - how can one get consistent reads without reading every
freaking replica all the time? (well - the answer is - one doesn’t - which is why Dynamo is
eventually consistent. I just hope that users/developers of Dynamo clones realize this now).

As Mr. Sarma astutely points out, the reason Dynamo works is because it makes no guarantees about instantaneous consistency. Assuming (again) that the client can tolerate conflicts and that the cluster will attempt to resync at the earliest possible opportunity, writing to non-authoritative nodes is perfectly fine. The system will _eventually_ come back into consensus.

Unfortunately, I'm pretty sure that my arguments will be insufficient to convince Mr. Sarma of the utility of Dynamo. I hope, however, that anyone reading this discussion will consider that reviewing the concepts of a paper is a very different task from executing on those concepts. As someone who has successfully executed ideas from that paper, I can assure Mr. Sarma that the concepts not only work, but they work surprisingly well.

Finally, the real contribution of the Dynamo paper is the balance that was struck between performance, reliability and pragmatism in the design of a production DHT. It underscores the importance of taking nothing for granted and being willing to consider counter-intutitive solutions to hard problems.

Filed under: General 1 Comment
1Nov/091

Thoughts on Dynamo’s “flawed architecture”

Posted by dizzyd

In general, I think it's a little inflammatory to make sweeping statements about the fitness of a given architecture. Every architecture has its flaws; it's an expected state when you are faced with diametrically opposing constraints. The real question that should be asked is whether or not an architecture solves the problems for which it was designed in a reliable and efficient manner.

Joydeep Sarma posted an entry claiming that Dynamo is a "flawed architecture". I'm not really qualified to prove or disprove Mr. Sarma's claim, but having implemented a Dynamo clone, I think that he may be a little confused about how things work in these systems. What follows are a few quotes from his write-up followed by my own responses.

Let’s say that one is storing key-value pairs in Dynamo - where the value encodes a ‘list’. If
Dynamo returns a stale read for a key and claims the key is missing, the application will
create a new empty list and store it back in Dynamo. This will cause the existing key to be
wiped out. Depending on how ’stale’ the read was - the data loss (due to truncation of the
list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded
data loss - not even in the case of a Disaster.

Dynamo implementations protect against this scenario by using vector clocks. If we define a "stale read" as one which returns the key (or absence thereof) and an older vector clock, then any writes which use this older/non-existent vector clock will generate a conflict and the server will store two versions of the same key. The application then has the opportunity to resolve this conflict on the next read. When used in conjuction with quoroms for reads and writes, this approach proves to be exceedingly robust.

Dynamo starts by saying it’s eventually consistent - but then in Section 4.5. it claims
a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting
the number of reads (R) and number of writes (W) to be more than the total number of replicas
(N) (ie. R+W>N) - one gets consistent data back on reads. This is flat out misleading. On close
analysis one observes that there are no barriers to joining a quorum group (for a set of
keys). Nodes may fail, miss out on many many updates and then rejoin the cluster - but are
admitted back to the quorum group without any resynchronization barrier. As a result, reading
from R copies is not sufficient to give up-to-date data.

One of the foundational assumptions in the Dynamo system is that you define as many replicas as necessary to achieve your desired level of reliability. As with any replication based system, if you lose all of your replicas, there is no meaningful recovery. However, if we assume that you will always have some number of replicas functional, and we introduce an appropriate quorum on operations, we can identify those nodes which return stale data and repair them appropriately. In other words, it's perfectly possible not to have resync barrier on joining, yet still ensure consistency in the answers provided to the client.

It might be helpful to recall that there are three levels of repair: read-repairs, hinted handoffs and replica synchronization. Two of these three are done in near-real time, thus minimizing the actual drift between nodes. Read repair deals with stale data on a per key/operation basis; the coordinator for a request can identify nodes responding with stale data and update them accordingly, using responses from other less stale nodes. Hinted handoffs are a bulk operation that is done when a node rejoins the cluster -- the keys updated while the node was down are replayed (in essence) to the rejoining node. Replica sync is something that is typically done once a day and does require a traversal of all the data for a given partition. Tricks like Merkel trees, however, permit only the changed portion of the data to be exchanged, so in practice it's not nearly as expensive as one might imagine in the abstract.

Lack of point in time consistency at the surviving replica (that is evident in this scenario)
is very problematic for most applications. In cases where one transaction (B) populates entites
that refer to entities populated in previous transactions (A), the effect of B being applied to
the remote replica without A being applied leads to inconsistencies that applications are
typically ill equipped to handle (and doing so would make most applications complicated).

The Dynamo paper makes it very clear that applications do require more logic to deal with these situations. Yes, it's more work for the application, but in practice, it's not that bad. It's also important to point out that data dependencies are handled differently in these key/value stores than they are in a typical ACID environment. Usually apps will store the data in a denormalized form, so dependencies amongst key versions are minimal (if they exist at all). This makes it much easier to deal with conflicts as all the relevant data is on hand during the resolution phase.

I'll leave it to someone else to do a more exhaustive analysis of Mr. Samra's arguments. It's been my experience over the past 2 years that Dynamo is one of those systems that you really have to see in action (or implement it) to appreciate the wonderful elegance and resiliency of the design. It's certainly not a one-size-fits-all solution, but works very well in the appropriate problem space.

Filed under: General 1 Comment