Going for gold: The building blocks of effective open-source software
Read our VP of Technology's perspective on how his experiences in the field have pushed him and others to innovate and tackle challenges facing developers across industries, culminating in fresh open-source projects.
This post is part one of a two-part series brought to you by DY Labs, an initiative by members of our Product and R&D departments who have a passion for experimentation and building developer resources for the greater digital marketing and engineering communities. This series focuses on the cohort’s latest project, Funnel Rocket, which is a serverless query engine optimized for complex user-centric queries at scale. In part one, Elad Rosenheim, our VP of Technology, walks through his journey in the world of product and engineering and how his experiences in the field have pushed him and others to innovate and tackle challenges facing developers across industries in order to develop open-source solutions.
An earlier version of this post was previously published by Elad on Medium.
Over the last decade, we’ve experienced the rise and rise of open-source software for big data and big scale. That was indeed a big leap forward, yet I’ve become disillusioned with much of the surrounding hype. Rants aside, here’s where I think the underlying issues are, and most importantly – what we can do about it. It’s my personal take, so let’s start with my own personal story.
From the dot-com bubble to the open-source era
I can trace my first real job back to around 1998. During that period, real servers were still made by companies like Sun, running Solaris on expensive proprietary metal. I recall catching a glimpse of a purchase order for a few Sun Ultra workstations to roughly around 2000. One line item, in particular, caught my eyes, as I could directly compare it to my PC at home: a cool $150 for a CD drive, worth about $230 today. It’s very probable that this drive came in Sun’s iconic purple, thus exuding its commanding authority over the pedestrian CD drive in your home, but still — it didn’t feel right. Savvy young developers started pushing for Linux, and people rushed to join the first wave of the dot-com bubble.
Fast forward to 2013, when I joined a tiny startup – Dynamic Yield – and started working on “web-scale” problems in earnest. Up to that point, scale (to me) usually meant tuning a beastly Java application server doing maybe 100 requests per second at most, each request typically hitting the relational database a dozen times within a transaction. Every single request used to mean something important that someone cared about. Now, data has become a clickstream from across the web: hardly transaction-worthy, but coming in at thousands of requests per second, and growing fast.
The founding team already had their first database (a single MySQL instance) brought to its knees on that traffic and moved on to Apache HBase for anything non-administrative. Before a year went by, I introduced Redis for fast access to user data and Elasticsearch for ad-hoc analytics to our growing spiderweb of an architecture. We were deep in NoSQL territory now. For batch jobs, custom Java processes and clunky Hadoop MapReduce jobs started making way for the new shiny tech of the day: Apache Spark. Kafka and stream processing frameworks would come next.
Much of that new wealth of tools for scale was directly inspired by Google’s whitepapers. These marked a big shift in how scale and resilience were tackled or at least thought about.
First, it was about GFS (the Google File System) and MapReduce, then BigTable. It pushed the wider tech community to think in terms of commodity hardware and inexpensive hard disks rather than purple hardware and RAID, relying on multiple replicas for high availability and distributing work. You realized hardware will fail and keep failing, and what you needed were tools that will happily skip over such road bumps with minimal fuss and grow in capacity as fast as you can throw new machines at ’em.
Not having to pay a hefty per-server license fee was a big plus. The concept of GFS inspired Apache HDFS, Google MapReduce’s principles were recast as Apache Hadoop M/R, and BigTable was remade as Apache HBase. Later on, Google’s Dremel (which you probably know today in its SaaS incarnation — BigQuery) inspired the Parquet file format and a new generation of distributed query engines.
To me, these were exciting times – being able to scale so much with open source, provisioning new servers in hours or minutes! The promises of elasticity, cost efficiency, and high availability were, to a large extent, realized — especially if you had been accustomed to waiting months for servers, for IT priests to install a pricey NAS, or for the procurement of more commercial middleware licenses.
I guess this goes against the grain of nostalgia permeating so much of the discourse. Usually, it sounds something like, “We did so much in the old days with so little memory and no-nonsense, hand-polished code! Oh, these kids and their multi-megabyte JS bundles will never know…” I like to compare that to a tip I’ve learned from parenting: There is always someone who has achieved so much more with far less but is simply eager to tell you all about it.
The burden of complexity
It seems like you only get a look at the real, gnarly underbelly of a cluster at the worst times: during peak traffic periods, at nights, or on holidays.
There is a complexity to a cluster’s state management that just does not go away, and instead, is waiting to explode at some point. Here’s how it usually starts off, at least in my experience:
- Whether you’re managing a random-access database cluster, a batch processing, or a stream processing cluster, you get it to work initially and for a while, it seems to work!
- You don’t really need to know how it elects a master, how it allocates and deallocates shards, how precious metadata is kept in sync, or what ZooKeeper actually does in there.
At some point, you hit some unexpected threshold that’s really hard to predict because it’s related to your usage patterns, your specific load spikes, and things go awry. Sometimes, the system gets completely stuck, sometimes limping along and crying out loud in the log files. Often, it’s not a question of data size or volume as you might expect, but a more obscure limit being crossed. Your Elasticsearch cluster might be fine with 2,382 indices, but one day, you get to 2,407 and nodes start breaking, pulling the rest of them down with them to misery lane. I just made up these numbers —your metrics and thresholds are likely different, but you get the picture.
In the best-case scenario, you solve the issue at hand that day, but often, the same issue will repeat. This is where you need to step back and give it time. Sometimes, it takes weeks to stabilize things or even months to hunt down a recurring issue. And over time, this becomes a time suck. Sometimes, multiple and seemingly unrelated incidents happen to all blow up over a short period of time, and the team experiences fatigue from fighting multiple fires simultaneously. Eventually, you’ve put enough work hours and thrown enough extra resources at the problem (i.e. supplying a few more servers, giving it more headroom) and it goes away, but deep down, you know it will return. As a head of R&D, I know this can be extremely draining, because whether the fire is at team A, B, or C — it’s ultimately your fire as well, every single time.
Sometimes, people will respond to your troubles with how great and reliable their tooling is or was at company X: “We’re using this awesome database Y and have never had any significant issue!” or “You’re over-complicating it, just use Z like us and your problems will go away.” However, this can often be over-simplifying the core issue at hand, skirting around the real complexity and nuances of your end goal. And over time, I’ve found solace in focusing on tending to my roots, because choosing to neglect them is only sustainable for so long.
Let’s talk about elasticity
Operational complexity is an enduring burden, but I should also mention limited elasticity and its friend, high cost. While tools do get better in that respect, what may have been considered “elastic” in 2012 is simply not what I’d call elastic today.
Let’s take Apache Spark for example. There’s still a steady stream of people going into Big Data who think it will solve all their problems, but if you’ve spent significant time with it, you know that in order for your jobs to work well, you need to carefully adjust the amount of RAM vs. CPU used, you need to dive into config settings, and perhaps even tinker with Garbage Collection a bit. You need to analyze the “naive” DAG it generates for your code to find choke points, then modify the code so it’s less about the cleanest functional abstraction and more about actual efficiency. In our case, we also needed to override some classes. And that’s how this song goes: care-free big data processing by high-level abstractions rarely lives up to the hype.
The challenge goes further, however.
One idea that Apache Hadoop pushed for since its inception was the ability to build a nice big cluster of commodity servers in order to throw several jobs from different teams at different times at its resource manager (nowadays known as YARN), letting it figure out which resources each job needs and how to fit all these jobs nicely. The concept was to think about the capacity of the cluster as an aggregate of its resources: this much CPU, storage, memory. You would scale that out as necessary, while R&D teams kept churning out new jobs to submit. That idea pretty much made its way into Spark as well.
The problem is, one single cluster really doesn’t like juggling multiple jobs with very different needs in terms of computing, memory, and storage. What you get instead is a constant battle for resources, delays, and occasional failures. Your cluster will typically be in one of two modes:
- Over-provisioned to have room for extra work (read: idle resources that you pay for), or
- Under-provisioned, leaving you wishing it was easy or quick to scale depending on how well jobs progress right at this moment (I’m not saying it’s not possible, but it’s definitely not easy, quick or out-of-the-box)
Web servers usually “have one job” and give you the ability to measure their maximum capacity as X requests per second. They rely on databases and backend servers, but they don’t need to know each other, let alone pass huge chunks of data between them over the network or thrash the disks. Data processing clusters do all these different things, concurrently, and never exactly at the same exact moment, which may cause issues. If you run into such problems, you’ll probably start managing multiple clusters, each configured to its needs and with its own headroom. However, provisioning these multiple clusters takes precious time, and operating them is still a hassle. Even a single job in isolation has multiple stages which stress different resource types (CPU, memory, disk, network), which makes optimal resource planning a challenge, even with multiple dedicated clusters.
You could get some of that work off your shoulders if you go for a managed service, but the “management tax” can easily cost outlandish amounts of money as you scale. If you’re already paying, say, $1 million a year for self-managed resources on the cloud, you have to ask yourself: Are we willing to pay 1.5x-2x that price tag to receive a managed-to-an-extent service?
New technologies (e.g. k8s operators) do provide us with faster provisioning and better resource utilization. There is one thing they cannot solve for, however: precious engineering time spent to thoroughly profile and tune misbehaving components, which in turn makes it very tempting to throw ever more compute resources at the problem. Over time, these inefficiencies accumulate, and as the organization grows, you’ll rack up a pretty significant bill to manage the entire system without an efficient solution.
Gradual improvements over time
Once your Spark cluster is properly set-up and running well, it can output huge amounts of data quickly when it reaches the result-writing stage. If the cluster is running multiple jobs, these writes come in periodic bursts.
Now, assume you want these results written into external systems, e.g. SQL databases, Redis, Elasticsearch, Cassandra. It’s all too easy for Spark to overwhelm or significantly impair any database with these big writes – I’ve even seen it break a cluster’s internal replication, which is a nightmare scenario.
You can’t really expect Elasticsearch to grow from, say, 200 cores to 1,000 for exactly the duration where you need to index things in bulk, then shrink back to 200 immediately afterwards. Instead, there are various things you can do:
- Aggressively throttle down the output from Spark — i.e. spend money being near-idle on the Spark side
- Write and manage a different component to read Spark’s output and perform the indexing (meaning dev time and operations)
- Over-provision Elasticsearch (= more money)
In other words, not really the elasticity I was hoping for.
Over time, patterns have evolved to alleviate some of these pain: Apache Kafka is frequently used as the “great decoupler”, allowing multiple receiving ends to consume data at their own pace, go down for maintenance, etc. Kafka, though, is another expensive piece of the puzzle that is definitely not as easy to scale or as resilient as the initial hype had me believing.
On the storage front, there have been improvements as well: instead of using HDFS, we switched to S3 for all batch outputs so we don’t need to worry about HDD size or filesystem metadata processes. That means giving up on the once-touted idea of “data locality” which, in hindsight, was a big mismatch. Storage size tends to go only one way: up. At the same time, you want the compute to be as elastic as possible, utilizing as much or as little as you need right now. Marrying both was always quite clumsy, but fortunately, AWS got its act together over time, improved intra-datacenter network performance considerably, and finally sorted out strong consistency in S3 (oh, the pain, the workarounds!). That brought it in line with Google Cloud on these points, making the storage-compute decoupling viable on AWS as well.
That last note is important: the building blocks offered by a cloud provider may encourage a good architecture (or push you away from one).
Don’t forget to create a dev wishlist
This isn’t an article in the style of “Stop using X, just use Y.” I believe you can create a wildly successful product (or fail spectacularly) with any popular technology. The question I tend to focus on instead is: “What constructs do we need to make systems easier, faster, and cheaper?” Here’s my partial wishlist:
- I want to launch jobs with better isolation from each other – each getting the resources it needs to run unhindered, rather than needing to fight over resources from a very limited pool, avoiding any possible domino effect from a single job derailing.
- I want the needed resources to be (nearly) instantly available
- I only want to pay for the actual work done, from the millisecond my actual code started running to the millisecond it ended.
- I want to push the complexity of orchestrating hardware and software to battle-tested components whose focus is exactly that rather than any applicative logic. Using these lower-level constructs, I could build higher-level orchestration that is way more straightforward. Simple (rather than easy) is robust.
- I want jobs to run in multiple modes: it could be “serverless” for fast, easy scaling of bursty short-lived work and interactive user requests at the expense of a higher price per compute unit/second. Or, it could be utilizing spot / preemptive instances — somewhat slower to launch and scale, but very cheap for scheduled bulk workloads.
I’m not inventing any new concepts here. The building blocks are essentially readily available these days. The open-source software to take advantage of all these to the max, however, is not so available. To demonstrate what I mean, let’s tackle a real challenge guided by these principles. In part two of this series, I will dive into Funnel Rocket, an open-source query engine that is my attempt at building a solution. The goal was to build something to solve a very specific pain point we needed to address here, at Dynamic Yield, but over time, as we’ve spent more time working on it, I’ve come to realize how it can also become a testbed for so much more. So now, let’s move on to part two.