non-RDBM distributed databases, map/reduce, key/value and cloud computing

I've been playing recently with several distributed databases with the aim of choosing the best solution for my needs. Since there isn't much documentation on the web with a general overview on the subject, I write here some comments, thoughts and my humble experience. Hope it's usefull for you, this document is not a comparison of performance, or a "mine is bigger than yours",  just some ideas ;)

My background
Being my experience based on relational databases (like MySQL or Postgres) and Object oriented databases (like ZODB) it was very easy for me to get hooked on this new challenge, anything but the relational databases (no offense). Last year, in Mainz, Germany, and after having heard a lot of buzz on the subject, Jan Lehnardt gave an interesting lecture on CouchDB. (BTW, thanks Jan for all the questions you answered to me). That event was the firestarter and since that I've been trying things... So I am a newbie in this expert area.

Now, I am working on a project that performs WRITES on very large set of data. All this data doesn't need to be in a relational database since it's a simple collector of activity, flat and with a variant structure that writes some gigabytes every day that must be processed at night, under different criteria.

To store this, I started doing some research on what I could use to store that data, and here are some of the things I tried. The principles were:

  1. Horizontally scalable
  2. Able to handle and process a lot of non-trivial data
  3. Able to write at least 1000 reqs/s, non-bulk
  4. Open source

(I still have a decision to make)

Tokyo Cabinet + Tokyo Tyrant (key/value)

Tokyo Cabinet is library of routines for managaing a database that in conjunction with Tokyo Tyrant, the network interface, create an excellent network database. Tokyo supports up to 8 Exabytes of data and it's really fast both in write and in read. Currently is used in the Japan's Mixi Social Network (more than 14 billion page views monthly). It's a key/value database.

From the sysadmin point of view was a lovely product, easy to install, start, manage, backup or replicate. It's everything as easy as they state in the website, and it works. If you, like me, are a Memcached lover, Tokyo is completely compatible with the memcached protocol, allowing you to do delights as "persistent but volatile data" . You can run the database in memory-only or with persistence (If you want in-memory, I think memcached is slightly faster for that purpose, but it's just my impression on few tests, do yours or Google it).

To start a master-slave system is as easy as (you might want only a master, or several slaves):

Start the Master: ttserver -port 1978 -sid 1 casket-1.tch

Start the Slave: ttserver -port 1979 -sid 2 -mhost localhost -mport 1978 -rts 2.rts casket-2.tch

Where casket are the databases and the extensions define what algorythm you want to use. Easy as hell. Just be aware to set a different and unique identifier (sid) for every server and say the slaves who is the master host (mhost). That would run a master and a slave on the same machine on ports 1978 and 1979.

Ok, that was really exciting for me but I found that despite Tokyo is wonderful and everything, as in memcached, you need to know the KEY of the element you want to retrieve (yeah, that's what a key/value is supposed to do). In my case, I need to map/reduce the vast amount of data, so Tokyo is not an option for me, but might be what you've been looking for. I'll use it for any other project, sure.

The weak points I found where that when used through the HTTP wrapper the performance goes down significantly. For instance, I had much better results using curl POSTs with CouchDB on a single node. Oh, and I hope you speak Japanese, you won't find much documentation in English ;)

Basic example on how to write data in Tokyo using the HTTP wrapper:

curl -s -X PUT -d "my_value" "http://127.0.0.1:1978/my_key"

But the fastest way is using the C binaries:

tcrmgr put -port 1978 127.0.0.1 "my_key" "my_value"

You can list the contents using:

tcrmgr list -port 1978 localhost

Here I link a presentation so you can see the whole pictureTokyo Cabinet and Tokyo Tyrant Presentation

CouchDB (document oriented database)

CouchDB is a schema-free document oriented database. A wonderful project of Apache with more community than the others behind it. CouchDB works using a RESTful HTTP/JSON API meaning that with simple HTML and JS you can create a dynamic website. Of course you can use other wrappers like PHPillow if you want to use PHP.

The default query language of the database is Javascript. You write map/reduce functions that are stored as views to get only the content of your interest, your docs are fully indexable, and although you might be pretty used to SQL do give CouchDB a chance.

Amongst all the systems is the one I like more as a product, although is not having a daemon for managing the service (tried the 0.9 from checkout) it is very easy to install and run.

Some things I really like:

  1. No language scripting needed
  2. The replication is bi-directional, peer-based
  3. Leave your nodes offline for a while, reconnect them and nothing happened.
  4. The Futon interface. Some sort of, let's say, phpMyAdmin where you can play and test your views

A basic sample of insertion using curl:

curl -s -X POST 'http://localhost:5984/mynewdb/' -H "Content-Type: application/json" -d '{ "my_key": "my_value" }'

If you want to insert a lot of documents you should insert documents in bulk mode. It's faster and your database will be more compact. Do not perform this test over a million iterations or your space disk will grow a lot.

What is CouchDB

MongoDB (document-oriented)

MongoDB is like CouchDB, another document-oriented database. This project was born this year 2009, so it's very young. With a colleague we tried to install the server and run it with the PHP libraries under CentOS, but it was completelly impossible.

Despite all, I'll keep an eye on it, very interesting as well, because:

  1. You can filter data by fields without knowing the keys (as couchDb)
  2. You can create indexes to accelerate the data filtering
  3. Sexy functionalities to operate with the database
  4. Internal drivers ready for C, Ruby or Python. A PHP module available.
  5. Without much learning curve
  6. Replication, failover and auto-sharding coming soon.

Project Voldemort (key/value)

Behind this curious name from Harry Potter there is an interesting Java distributed key-value system database. Project Voldemort handles replication and partitioning automatically, every server contains only a subset of the data and if a node fails your data integrity is not compromised, nor there is a central point for coordination. The balancer read from any node but writes in all them at the time, that can reduce your writing experience. This system is used in some parts of the well-known LinkedIn with 12TB of data. But I am not a big fan of Java, so I played a little bit, and after going nuts running the examples and installing the Ecplise stuff,  i tried something else, not for me :$  (Not a lover of Tomcat either)

Again, another key/value. Doesn't fit me at this moment.

Hbase (column-oriented)

Hbase is a Key/value Hadoop project. I am using it right now with the HTTP interface and it's very fast on writing (the quicker for my tests). If you want to serve your website content from this database, you better skip this one, the latency is high. I jumped from here to cloudera.

Some basics:

Create a table in Hbase:

 

curl -H "Content-Type: text/xml" -H "Accept: text/xml" -v -X POST -T - http://ip.to.hbase:60050/api/

Then paste something like this and press Ctrl+D twice after:

 

<?xml version="1.0" encoding="UTF-8"?>
<table>
<name>table1</name>
<columnfamilies>
<columnfamily>
<name>col1</name>
<max-versions>2</max-versions>
<compression>NONE</compression>
<in-memory>false</in-memory>
<block-cache>true</block-cache>
</columnfamily>
<columnfamily>
<name>col2</name>
<max-versions>2</max-versions>
<compression>NONE</compression>
<in-memory>false</in-memory>
<block-cache>true</block-cache>
</columnfamily>
</columnfamilies>
</table>

After version 0.19 the JSON notation will be supported, so you will be able to do things like:

curl -H "Content-Type: application/json" -H "Accept: application/json" -v -X POST -T - http://ip.to.hbase:60050/api/

{"name":"test5", "column_families":[{ "name":"columnfam1", "bloomfilter":true, "time_to_live":10, "in_memory":false, "max_versions":2, "compression":"", "max_value_length":50, "block_cache_enabled":true } ]}

 

 

Show tables in Hbase:
Any of these:

curl -v http://ip.to.hbase:60050/api/
curl -v -H "Accept: text/xml" -X GET http://ip.to.hbase:60050/api/

And to see the metadata of a table:

curl http://ip.to.hbase:60050/api/table1

 

Insert a row

 

curl -v -T row_contents.xml http://172.20.4.42:60050/api/table1/row/

 

... where row_contents.xml has something similar to the XML code pasted above.

Other commands...

Disable/enable a table:

curl -v -X POST http://ip.to.hbase:60050/api/restest/disable
curl -v -X POST http://ip.to.hbase:60050/api/restest/enable

PUT:

curl -v -T y.row http://ip.to.hbase:60050/api/table1/row/y?column=a

GET:

curl -v  http://ip.to.hbase:60050/api/table1/row/y

SCAN

curl -v -T ./y.row http://ip.to.hbase:60050/api/table1/scanner?column=a:
curl -v -T ./y.row http://ip.to.hbase:60050/api/table1/scanner?column=a:
# That returns an ID, like b959591, retrieve it:
curl -v -T ./y.row http://ip.to.hbase:60050/api/table1/scanner/b959591

Cloudera (hadoop cluster)

Cloudera is a company that offers an Apache-licensed packed solution based on the Hadoop family. If you are interested in cloud computing and you have been after projects like the Yahoo!'s Pig or the Facebook's Hive,  then you'll like to know that Cloudera brings to the 'mere mortals' all that power in an easy RPM.

Cloudera combines the Hadoop cluster and filesystem HDFS with Hive and all the pain that is configure a Hadoop cluster. You will be able to summarie, analyze and query an ingent amount of data. Have a look at the latest video of training sessions to get an idea on how the Hive query language similar to SQL looks like. You can even download an VMWare virtual machine with the solution running.

I am recently looking at this, it's a lot more complex than any of the other solutions, but since I'll need to store and process several Gigabytes every day, this seems the way to go...  :_/ sniff...

If you are interested in this, you can also have a running Hadoop via Amazon Elastic MapReduce, billing is hourly and by machine.

More interesting projects:

The folling project aren't less interesting, nor I hadn't time to fully test, but for my current needs aren't suitable,  they are worth a mention anyway:

These are only a few ideas that may help to you. Hope it helped :)