Using Innostore with Riak
Innostore is an Erlang application that provides an API for storing and retrieving key/value data using the InnoDB storage system. This storage system is the same one used by MySQL for reliable, transactional data storage. It's a proven, fast system and perfect for use with Riak if you have a large amount of data to store. Let's take a look at how you can use Innostore as a backend for Riak.
(Note: I assume that you have successfully built an instance of Riak for your platform. If you built Riak from source in ~/riak, then set $RIAK to ~/riak/rel/riak.")
We first get started by grabbing a stable release of Innostore. You'll need to download the source for a release from: http://bitbucket.org/basho/innostore/downloads/
Looking in the "Tags & snapshots" section, you should download the source for the highest available RELEASE_* tag. In my case, RELEASE_4 is the most recent release, so I'll grab the bz2 file associated with it:
http://bitbucket.org/basho/innostore/get/RELEASE_4.tar.bz2
Once I have the source code, it's time to unpack it and build:
$ tar -xjf innostore-RELEASE_4.tar.bz2
$ cd innostore
$ make
Depending on the speed of the machine you are building on, this may take a few minutes to complete. At the end, you should see a series of unit tests run, with the output ending:
=======================================================
All 7 tests passed.
100222 7:43:58 InnoDB: Shutdown completed; log sequence number 90283
Cover analysis: /Users/dizzyd/src/public/innostore/.eunit/index.html
Now that we have successfully built innostore, it's time to install it into the Riak distribution:
$ ./rebar install target=$RIAK/lib
If you look in the $RIAK/lib directory now, you should see the innostore-4 directory alongside a bunch of .ez files and other directories which compose the Riak release.
Now, we need to tell Riak to use the innostore driver as a backend. Make sure Riak is not running. Edit $RIAK/etc/app.config, setting the value for "storage_backend" as follows:
{storage_backend, innostore_riak},
In addition, append the configuration for the Innostore application after the SASL section:
{sasl, [ ....
]}, %% < -- make sure you add a comma here!!
{innostore, [
{data_home_dir, "data/innodb"}, %% Where data files go
{log_group_home_dir, "data/innodb"}, %% Where log files go
{buffer_pool_size, 2147483648} %% 2G in-memory buffer in bytes
]}
You may need to adjust the directories for your data_home_dir and log_group_home_dirs to match where you want the inno data and log files to be stored. If possible, make sure that the data and log dirs are on separate disks -- this can yield much better performance.
Once you've completed the changes to $RIAK/etc/app.config, you're ready to start Riak:
$ $RIAK/bin/riak console
As it starts up, you should see messages from Inno that end with something like:
100220 16:36:58 InnoDB: highest supported file format is Barracuda.
100220 16:36:58 Embedded InnoDB 1.0.3.5325 started; log sequence number 45764
That's it! You're ready to start using Riak for storing truly massive amounts of data.
1.44 am
It's 1.44 am. Woke up feeling weird; then my mind went running, afraid of what it might find.
I was diagnosed with follicular lymphoma three weeks ago now.
I'm blessed in a lot of ways. The cancer is slow moving, non aggressive -- or so it appears at this point. I might not even require treatment in the near future. Even if I do require treatment, survival rates have jumped from 60% to 90% in the past five years -- the treatment for this cancer is progressing quickly. My company, Basho, has been wonderful to me in terms of helping me sort out a variety of insurance issues and arranging access to very good doctors.
All of these things are probably the reason I've not had any trouble sleeping until tonight.
It's still scary though. Cancer -- just the word inspires fear when you first hear it. You are struck, relatively quickly, with the fragility and preciousness of life. You suddenly have a deep desire to grow old. The prospect of death is a powerful incentive to live.
I cried more the first few days and weeks than I ever have in my 32 years. I cried because I was scared. I cried because I was worried about my wife, our 2 year old and the new baby on the way. I cried because it felt unfair, unwarranted! I cried because I realized that there were some areas of my life that I had wasted -- and I wondered if I would have the chance to rectify them.
As I've gotten further into this process, emotions have settled out a bit. I realize now just how good I have it with this cancer. What I'm facing is absolutely nothing compared to other people I know with chronic medical conditions. It's a smudge on the screen; a minor distraction. There might be some tough times ahead, but my overall probability for immediate mortality is relatively stable and low.
That said, I'm determined to make the most of this challenge. If I must go through this valley, I'm going to extract every bit of growth from it that I can. I choose to grow, to push my boundaries in every dimension: physically, spiritually, mentally, emotionally. I choose to spend more time with my family and less time with wandering the mental spaces of coding. I choose to listen more and speak less. I choose to be grateful that all of these realizations have been granted to me at 32 instead of 64.
It's now 2.21 am. I think it was just the Chinese food from dinner that woke me up.
Rebar
Over the past two months, I've been busy taking the lessons learned from erlbox and designing a pure Erlang build tool called rebar. While erlbox is a very complete toolkit of rake functions for building Erlang code, it has a couple of significant problems. First off, the external dependency on rake is often a significant problem for developers who are not conversant in Ruby. While anyone can learn Ruby, if you're an Erlang developer you likely have other tasks to attend to than learning a language solely for the purpose of maintaining your build system. The other significant problem with erlbox is that it spends a lot of time going in/out of Erlang to do "Erlangy" sorts of checks -- like parsing/validating the .app file, running eunit, etc. This leads to erlbox being a relatively slow build system, not to mention a little awkward to maintain since it was an odd mix of Ruby and invocations of Erlang.
Thus, rebar was born. As a strictly Erlang implementation, it's possible for Erlang developers to dig into it and improve/modify with minimal effort. It's also wickedly fast, since it starts the VM up only once and has direct access to all the tools one needs to build and validate Erlang code. It has the added advantage of being able to take advantage of Erlang's inherent parallelism, so where possible, it runs commands concurrently. Finally, it's designed to be a self-contained escript, so using rebar doesn't introduce any build dependencies other than a stock Erlang install. You simply drop the rebar script into your code tree and go!
You can see a demonstration of converting an existing app to rebar here.
Create and compile a simple OTP application by doing the following steps on a terminal:
$ mkdir myapp; cd myapp
$ wget http://bitbucket.org/basho/rebar/downloads/rebar; chmod u+x rebar
$ ./rebar create-app appid=myapp
$ ./rebar compile
Documentation is still scarce -- that's something I'm going to be working on over the next few weeks. The core pieces of rebar are mostly at a point that I'm happy with; now it's time to polish.
If you have questions about rebar, or especially feedback after using it IRL, please ping me on Freenode IRC -- I'm typically in the #riak room.
Running erl in a debugger
Let's say you need to debug a port driver in Erlang. This typically involves gdb (unless you prefer the printf route). Go to where erlang is installed and edit the bin/erl script. Change the last line from:
exec $BINDIR/erlexec ${1+"$@"}
to:
if [ ! -z "$USE_GDB" ]; then
gdb $BINDIR/erlexec --args $BINDIR/erlexec ${1+"$@"}
else
exec $BINDIR/erlexec ${1+"$@"}
fi
Now all you have to do to get Erlang running in gdb is:
$ export USE_GDB=1
$ erl
If all goes well, you should see something like:
(dizzyd@sigr).(~)% export USE_GDB=1
(dizzyd@sigr).(~)% erl
GNU gdb 6.3.50-20050815 (Apple version gdb-1344) (Fri Jul 3 01:19:56 UTC 2009)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries .. done
(gdb) r
Starting program: /Users/dizzyd/Applications/erlang-r13b03/lib/erlang/erts-5.7.4/bin/erlexec
Reading symbols for shared libraries +. done
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00007fff5fc01028 in __dyld__dyld_start ()
(gdb) c
Continuing.
Reading symbols for shared libraries ... done
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:8:8] [rq:8] [async-threads:0] [kernel-poll:false]
Eshell V5.7.4 (abort with ^G)
1>
Further thoughts on Dynamo’s “flawed architecture”
Mr. Sarma revisits his claims that Dynamo is a universally "flawed architecture". I certainly concur that Dynamo has its flaws, but making sweeping claims about something being universally so is to under-value the contribution to production thinking that Dynamo contributes. So, once again, I'm going to take a few choice quotes from Mr. Sarma and respond to them.
However, i remain convinced that one should not force clients to deal with stale reads in
environments where they can be avoided. As i have mentioned in the updated initial post - there
are simple examples where stale reads cause havoc. One may not be able to do conflict
resolution or the reads can affect other keys in ways that are hard to fix later.
Arguing applications "may not be able to do conflict resolution" is non-sensical -- by definition, Dynamo requires that the application be cognizant of conflict resolution! This isn't an arbitrary decision to make clients aware of conflicts. It's a part of a measured approach to building a robust system. One may not agree with it, but to claim that Dynamo is universally flawed just because it does not conform with one's personal feeling about layering is dis-ingenous at best.
Please understand me, I make no claim that Dynamo is the end-all-be-all for data stores. It is a terrible, terrible choice for some problem spaces. However, if you want a low-latency, highly-robust key/value store it works quite well.
About Vector Clocks and multiple versions - it’s not a surprise that they were not
implemented in Cassandra. In Cassandra - the cost of having to retrieve many versions of a key
increases the disk seek costs reads multi-fold. Due to the usage of LSM trees, a disk seek may
be required for each file that has a version of the key. Even though the versions may not
require reconciliation, one still has to read them.
This is an argument about implementation details of Cassandra and has nothing to do with whether or not Dynamo is a universally flawed architecture. I can say from experience that vector clocks do not have to be slow -- as with anything, careful implementation can yield surprisingly fast results. I would also note that in the production systems where I've deployed Dynamo-clones, the actual occurrence of multiple versions (or conflicts, in Dynamo terms) is quite rare. The original Dynamo paper (sect 6.3, para 3) notes that 99.94% of all requests return a single version; this matches closely with what I've observed in my own production deployments today (99.91%).
Also, implementation-wise, one doesn't typically keep resolved versions lying around -- the only time there are multiple versions present on disk is when a conflict has not been
resolved. One _could_ keep old versions around, I suppose, and in that situation I agree that you would want to carefully design your store so as to avoid unnecessary seeks when reading the "current" version.
So, unfortunately, i am repeating this yet again - Dynamo’s quorum consensus
protocol seems fundamentally broken. How can one write outside the quorum group and claim a
write quorum? And when one does so - how can one get consistent reads without reading every
freaking replica all the time? (well - the answer is - one doesn’t - which is why Dynamo is
eventually consistent. I just hope that users/developers of Dynamo clones realize this now).
As Mr. Sarma astutely points out, the reason Dynamo works is because it makes no guarantees about instantaneous consistency. Assuming (again) that the client can tolerate conflicts and that the cluster will attempt to resync at the earliest possible opportunity, writing to non-authoritative nodes is perfectly fine. The system will _eventually_ come back into consensus.
Unfortunately, I'm pretty sure that my arguments will be insufficient to convince Mr. Sarma of the utility of Dynamo. I hope, however, that anyone reading this discussion will consider that reviewing the concepts of a paper is a very different task from executing on those concepts. As someone who has successfully executed ideas from that paper, I can assure Mr. Sarma that the concepts not only work, but they work surprisingly well.
Finally, the real contribution of the Dynamo paper is the balance that was struck between performance, reliability and pragmatism in the design of a production DHT. It underscores the importance of taking nothing for granted and being willing to consider counter-intutitive solutions to hard problems.