Apache HBase for the Win, Part 1: All the Ways in Which it’s Bad – and Why it’s Still Great

How we use Apache HBase to handle billions of monthly events coming from JavaScript-code embedded in our clients’ websites.

VP of Technology, Dynamic Yield

Let me begin with a confession: a bit over a year ago I joined Dynamic Yield, which was then still a small start-up with a few desks and big dreams. The tech-pioneers at our company had just started using HBase instead of the initial solutions which proved non-scalable, and so like any good boy I’ve set out to read about the tool – but initially I just didn’t get it. By that time, I already knew the basics of MongoDB and Redis, so I already “got the idea” about document stores, Key-Value stores and how you could represent higher-level data structures with them a-la Redis. However, with HBase, something just didn’t click initially.

To paraphrase Elaine on that infamous Seinfeld episode, I basically wondered aloud (as I often do): “HBase – what is it good for?!”. The somewhat-vague Wikipedia entry didn’t help either.

War: What is it Good For?

War: What is it Good For? (from the episode “The Marine Biologist”)

Some time has passed since then, though (in start-ups, a year should count as seven at least, I think). My purpose here is to share with you our realizations about what HBase *isn’t* (or isn’t yet), and then elaborate about what it does really well and how to use the tool effectively. One thing to note: there is a growing corpus of texts and presentations about HBase use cases, some of which describe impressive production clusters comprising of hundreds or thousands of nodes. Our production HBase cluster is on the small-side in comparison, fluctuating under 10 nodes. However, we do use it to handle billions of monthly events coming from JavaScript-code embedded in our clients’ websites (such as nytimes.com). Thus, our realizations might fit a lot of smaller companies that do web-scale on the small(er) scale.

In future posts, I’ll address our production layout in more detail.

HBase Considered Harmful?

Let’s begin then with the nitpicking: all these places where HBase is yet less than ideal vis-a-vis some other popular tools.

NoSQL Logos

HBase as a Document Store

Unlike MongoDB, CouchDB or other document-oriented databases, HBase has no native concept of hierarchical data, sub-collections or indeed data types at all: everything is a byte array, with just the simple Bytes helper class coming to your help. You might of course have a layer that maps arbitrary data objects to columns and vice-versa, handling complexity and types (see the Kiji project for example), but these solutions are never going to be as standard or as high-performing as tools built for the job having their native optimized data formats (such as BSON).

Naturally, there are also no built-in easy finder methods that select a portion of data based on data types or any internal hierarchy of data (though you could use Hive, or now Apache Phoenix, as an added SQL-like layer on top). Using HBase alone, trying to run ad-hoc queries over your data is just not fun, nor is it simple.

Moreover, it seems the KeyValue format employed by HBase includes a copy of the full row-key for each and every column-value pair, a fact which can be considered quite wasteful when you have a ton of columns (but at least it also lends to quite good compression ratios). While I’m no HBase internals expert, I’d guess the row-key duplication plus long column qualifiers would have a negative effect on cache performance and network payload size.

Of course, document-oriented databases also need to make some tradeoffs of their own, such as having to store some extra data bytes per record, to gain the ability to do partial updates of data in-place. This is something a very storage-efficient format like MessagePack cannot do.

HBase as a Key-Value Store

Yes, HBase is indeed a (row key, column qualifier)->value store. However, it is definitely not meant to compare with the likes of memcached or Redis in terms of pure throughput in this ruthless ops/sec-oriented world. Sure, plain-vanilla memcached doesn’t do any persistence or clustering for you, but consider Redis: it does configurable persistence, it offers a lot of built-in constructs (hashes, sets, lists, even a HyperLogLog now!) and should soon offer out-of-the-box clustering , and of course open-source solutions for sharding Redis have already been out there for some time. As for the value part of the equation in Redis, that value could be encoded either as hierarchical JSON or in the more compact MessagePack format, both supported natively in Redis’ Lua-based scripting engine.

Redis *could* be your main DB for some types of data, though the inherent limits of RAM size and the lack of almost any access control or data scanning/filtering commands still leave Redis as a great match for some use-cases, and seriously inadequate for others. In fact, it is a main pillar of the Redis philosophy to support its core set of features really well, instead of trying to offer something for everyone (like commercial DB vendors sometimes feel compelled to do).

However, assuming we don’t expect HBase to ever be as fast as a big in-memory hash-table in the sky, how performant could we expect it to be? Here, the main factors are the complexity of the read path and cache efficacy. If a data block is not already cached inside HBase’s region-server heap, then there was traditionally the cost of opening a connection to the underlying Hadoop data-node (hopefully, a local data-node) and then reading a block of data from HDFS, transferring it over TCP/IP, parsing it and looking for the wanted row-key in it, which might not be there at all.

Even though I’m in the middle of my rant section here, I should already say that nice strides have already been made both in the in-process caching capability of HBase, and in using more direct paths to read from the HDFS file system. However, it’s an open question whether a random-access solution based on an inherently batch-oriented storage mechanism, such as HDFS, can compete favorably with other more single-minded solutions (see for example here). As the nature of Hadoop itself now changes rapidly, I’m personally staying tuned to see what performance improvements come next.

Automatic Mega-Scalability

Being an implementation of Google’s BigTable, you would expect HBase to be pretty big on the “Big” side of things – and it is. It can grow to be really, REALLY big. However, it shares the same drawback as BigTable: its growth is hardly magical, and you need to work on each and every table to make it truly scalable. What I mean, of course, is the issue of row-key design: since row-keys are sorted in lexicographic order, then the naïve row-key designs, which make the most sense to implement and use for scanning, oftentimes turns out to be the most inefficient ones. The usual example is time-stamped data: if the timestamp is the first component of your row-keys, it would be super easy to scan for ranges of time, but all new inserts would go at the end of the table into a single “hot” region handled by just one region-server at a time. The rest of your nice cluster would just sit there wasting compute hours.

For that reason, the web is full of discussions and ideas on proper key design, mostly revolving around using salts/prefixes to your row-keys (usually based on a consistent and well-distributed hash of the rest of your key). In other words, there’s a lot of work to do in order to balance between good distribution of data and efficient reads (the proper solution being always dependent on your specific write and read scenarios), and there are trade-offs to every design.

One other thing to note is that scalability of course does not make for a magically super-robust solution: The “Mean Time To Recovery” (MTTR) in case of unexpected region-server failures has been less than stellar in the past, but there’s work done on that front as well.

HBase as a MapReduce Input and Output

Being an old-time member of the Apache Hadoop eco-system, you would think that HBase should at least have first-class integration with Hadoop – primarily, as an input or output for MapReduce jobs. However, the support in place has long felt like something of a patch: usually, you’d like minimal overhead involved in reading the raw data going into your mappers other than the minimal work done for directly reading and parsing source files. However, in the HBase case you actually need to go the full read path through your region-servers – and this adds a lot of overhead. Worse, it puts an extra stress on your online production HBase cluster for the duration of your jobs. The TableInputFormat helper class in HBase is only a thin wrapper over normal HBase scans, not a magical short-circuit path to the raw key-values down below. Effectively, that’s like wrapping a query to any external DB as an input source, other than the added value of running the MR job on the same node where the region server is (and hopefully, also the data files).

However, again – the scene is now changing for the better.

And now, for the Usual Rants About Java

The days of great PR for the JVM as a platform appear to be long over. With everyone praising V8 and the like, many techies would have only two things coming to mind when thinking about Java and its underlying engine: a lot of verbose boilerplate code (true for Java, at least), and a lot of Garbage Collection activity. In reality, I think oftentimes the GC rants are a bit unfair: people often forget that the actual workloads they put on Java servers are tend to be much higher than what their nice Node.js-based server actually does (in terms of actual data crunch, not just acting as a router calling other services asynchronously).

However, with HBase, the GC worries are dead-on justified. With a lot of small writes coming in, stored in memory and flushed, and then a big in-heap cache, you might experience frequent pauses of approximately 2 to 10 seconds each, which makes read times unstable and unpredictable. In fact, that’s one of the implicit claims made by MapR in favor of their native code-based alternative to HBase: you can’t really compare a Java-based solution to a well-optimized native one when you need a well-controlled, well-understood level of service.

MapR M7 performance compared to Apache HBase

MapR M7 Edition read latency histogram, as compared to Apache HBase by MapR

There are some battle-tested hints for GC optimization (here’s one concise recipe), which usually trade increased overall CPU activity for shorter stop-the-world pauses, but don’t expect any miracles.

HBase for the Win, Still?

All that being said, the fact is that I’ve become a big proponent of HBase. It has given us tremendous scale for our operations, and, while neither is it the simplest turn-key solution you might dream of nor is it the fastest for any scenario, it can actually carry a lot of weight even on smaller clusters.

In fact, if you look at what’s been happening recently for HBase 0.96, 0.98 and beyond, you’d see that big gains have been made regarding all the issues I’ve addressed and many more.

  • Performance is being improved on all fronts with smarter caching (for example here), shorter read paths to HDFS data and more. Load-balancing of regions between servers has been made smarter and the effect of background compactions on overall performance is being reduced.
  • Support for table snapshots allows for reading from raw files directly in MapReduce jobs, greatly improving performance and taking the edge off your online region-servers. Snapshots also simplify backups, and there seem to be a lot of features down the pipeline to expand what you could do with them.
Performance of MapReduce over HBase table snapshots, compared to MapReduce or direct scan via Region Servers

Performance of MapReduce over HBase table snapshots, compared to normal scanning via RegionServer

  • With tools like Phoenix, an SQL database-like experience at above reasonable speed is now possible, reducing the need to write cumbersome code just to get to your data.
  • Robustness is improved with more attention now being given to recovery times, more comprehensive testing focused on cluster stability and a steady pace of bug fix releases.

Overall, I see a very vibrant community, which seems to just grow stronger: take a look at the lineup for the recently wrapped-up HBaseCon 2014. There’s incredible momentum building up all across the Hadoop eco-system, so much so that, in fact, it’s getting hard to keep track of all those incubator projects and where they fit into the big picture. HBase seems to be one of those few components that are not only already well-established, but actively being worked on. My underlying message here is: just don’t expect any silver bullet. If you know what you’re doing (which you must, if you hope to build any system at scale), HBase might do the job really well.

In my next post, titled “HBase for the Win, Part II: 4 In-Depth Tips for Pulling Less Hair Out“,  I dive deeper into the effective use of HBase along with other accompanying tools that might help you build that buzzword-worthy production beast you need.