Turning messy data into a gold mine using Spark, Flink, and ScyllaDB
With our user's data spread out across numerous locations, a set of real challenges emerged. Find out how we set out to unify it, along with all the technical decisions we made and new opportunities it unlocked.
At Dynamic Yield, we eat, sleep, and breathe data – it is the bedrock of our personalization technology. Our R&D department is constantly working to ensure all of the valuable information we capture through the platform fuels the development of new products and features.
Recently, we faced a real set of challenges with regards to how we stored our user’s data. Seeing as every dev team works on different features and leverages the data for their own unique purposes, naturally, portions of it would end up stored in different databases – wherever was most suitable based on the team’s technical needs. However, with data spread across numerous locations, it became more and more difficult to see the big picture. Developers began struggling to analyze the data and understand simple questions about our users. And in turn, it was harder for them to develop features based on it.
To solve this, we set out to unify it – all of it – made possible with the help of Apache Spark, Apache Flink, and ScyllaDB. In the rest of this post, I will walk through all the technical decisions we made to establish a unified database, which has not only unlocked new opportunities for our department as a whole, but also our customers.
Our attempt at unifying user data into one place
Although we didn’t tackle everything from the onset, we did want to plan on addressing these four main technical issues to effectively accomplish our goal:
- Defining internal and customer needs
- Choosing a database (DB) which is capable of supporting our solution
- Migrating historical data into the DB
- Streaming live events without any gaps between historical and streamed data
Choosing a database
Next, a few core requirements needed to be ironed out.
Should store large amounts of data – We collect a lot of data about users and have been doing so for several years now, so we needed an option that would not only be able to handle our current database but also scale as it becomes larger.
Read latency must be very low – When users come back to our customers’ sites, we quickly analyze everything we know about them to offer a personalized experience. This process of analysis and delivery has to happen almost immediately in order for the web page to load within a reasonable amount of time.
Needs to contain both historical and real-time data – We analyze user data and for us to make accurate decisions based on it, the data has to be up-to-date. Therefore, both historical user data, as well as current activity, should be taken into account.
Flexible query support – Today we may be interested in only certain pieces of data, but tomorrow we might be interested in another portion of it. Proper query support was needed to allow us this kind of flexibility.
Why we chose ScyllaDB
ScyllaDB is one of the newest NoSQL databases out there. Described as an Apache Cassandra replacement with ultra-low latency and extreme throughput, with most of Cassandra’s features implemented in C++, it’s able to present very impressive benchmarks.
Additionally, it shared other similarities with Apache Cassandra we liked:
CQL (Cassandra Query Language) support – Using CQL, we could easily write queries to fetch the data we needed.
Easy integration with Apache Spark – When writing our Spark applications, we could use the ‘spark-cassandra-connector’ in order to write and read from Scylla.
Its low read latency meant we could confidently use it for serving experiences. And because it’s not an in-memory database, it wouldn’t be as expensive. As I mentioned before, we needed to store a lot of user data, and given the number of nodes we would need, this was a big factor for us.
From a safety point of view, Scylla supports data replication according to the replication strategy of your choosing. So if one Scylla node crashes, there’s no fear of losing the data.
After we got our DB up and running, it was time to fill it with some data.
Migrating old data to the new DB -Using Apache Spark we were able to migrate all of our data to the new Scylla DB very quickly. The spark application was pretty simple as it only needed a start and end date to get started. It then scanned the old ‘raw data’ within the time range allotted and wrote it to Scylla. This ‘raw data’ functions as our big data store – great for batch jobs, but not for random access.
*Note* we only had to run the spark application once in order to migrate all data to into Scylla DB. If we ever need to migrate data into Scylla again, we will be able to use the same application. In fact, we have used this capability more than once when developing.
Streaming online events to Scylla -As previously mentioned, one of our requirements was that Dynamic Yield’s Unified Customer Profile would always contain up-to-date data. For this purpose, we used Apache Flink – our application receives streams of user events as they occur from our Kafka topics, applies some transformations to the data to fit our schema, and then writes them to Scylla.
Not all data is created equal. For example, a purchase event can provide a lot more insight into a user’s preferences than a simple page view. With a large percentage of our data representing page views, we knew there wasn’t a need to store them forever.
Using a table-based format for our data modeling, we save all user events in one table. Events like purchases or add-to-carts are never deleted, while page views are stored for a three month period. Rather than saving these page views, what we do is actually extract ‘aggregated data’ from them.
We store this aggregated data, which can also include a user’s location, browsers, devices, etc., in another table. If we already know a user has visited a web page from the U.S., there is no need to save the information again when they revisit the web page from the same location and so on.
The merits of our unified customer profile
Having all of our data in one place has opened many doors for us, aiding our understanding of each user who interacts with an experience launched through the Dynamic Yield platform in a much more efficient way. And through the work we’ve done, we are now able to bestow this knowledge on our customers, who can use the rich data from what we refer to as our Unified Customer Profile (UCP) to:
- Provide tailored content every time a user revisits a site or app
- Leverage machine learning algorithms to match each user with the optimal experience (based on both long-term history and in-session behavior)
- Develop custom individualized experiences through server-side implementation
Not to mention, meeting the requirements of GDPR became significantly easier when all of our user data was unified, another huge bonus of our efforts.
Have questions about the process we went through or want to share stories about how you unified your company’s data? Comment below!