Doing Data Science in a Startup: The Hard Truth
I hate to break it to you, but a high-tech Internet startup is not a natural environment to do research.
I hate to break it to you, but a high-tech Internet startup is not a natural environment to do research. Most startups come into existence around a very applicable and practical idea (hopefully), which either requires no scientific research or the core research was already done by the founders before the startup came to be. However, there are a number of advantages that can make startups a much more attractive working experience than classic academic-style research.
As an always-careful scientist, I should also start with a disclaimer: I am writing from my own personal experience, which is definitely not a statistically significant sample size :-). So far, during my career, I have experienced research in academia, in a government organization and now here at Dynamic Yield. Knowing that startups often keep a single data scientist on staff, my goal for this post is to pass on my personal experience: the thoughts and challenges I’m facing on a daily basis, with the hope that some of you fellow data scientists out there could relate to it.
When taking a data science job at a new workplace, we often change the domain from which our big data is generated, and a lot of time goes into building our domain-specific knowledge and intuition. I came into Dynamic Yield about a year ago knowing almost nothing about the world of web content publishers, e-commerce, online marketing and advertising, or even what the heck a CTR is (perhaps some kind of medical procedure? I somehow assumed that the “R” stands for radiation). To be honest, I’m willing to bet that, on average, data scientists are much less of a social being than marketers or professionals in almost any other field for that manner – so don’t be shy, talk with people in the field and try to learn as much as you can about your new domain. At the end of the day, these people will come to you with questions for which it is their domain knowledge where the key would often lie. Simply put, if you don’t know their domain – you can’t help them.
The first major difference between academic and startup-style research is of course time constraints. Even though a long time span is often needed for mental processing, I’m sorry but no, there’s simply no time to procrastinate for days and read PHD comics. As so eloquently described in Uri Alon’s TED talk, scientific research often means being stuck in a cloud of uncertainty and confusion, and this is a much more stressful state when you’re in a startup and expected to deliver quickly. In a startup, time is the most limited and valuable resource.
For me, this means that I always try to avoid reinventing the wheel. I read through articles looking for solutions that have some resemblance to the problem I’m working on, and assess whether I can “fill in the gaps” in a reasonable amount of time to fit my own domain. When available, I use open source tools rather than trying to implement my favorite machine learning algorithms by myself, as fun as it might be (sometimes, of course, there’s simply no open implementation of the desired algorithm). While on this topic, Python and its great toolset – numPy, sciPy and scikit-learn – are my tools of choice whenever scale allows.
Another major difference lies in the kind of solution you’re expected to deliver, which usually has to obey a ton of requirements from Product and bend to a myriad of technical constraints. This whole affair is far, far removed from that pure and abstract problem you dream of working on at while in academia. A classic example here would be the i.i.d assumption of your sample: in real life, practical constraints often render the i.i.d. assumption simply wrong, but what should you do about it? I wish there was something you could do, but more often than not you find yourself merely making a mental note of it, carrying on with your business and hoping for the best. If your model doesn’t work, you’ll often have a million other things to try out, tweak and tune before you go back and try to tackle this core issue.
Now, for the fun part: when doing research at a startup, you usually have a quick path to production – a VERY quick path. If you’re lucky, the startup environment offers a resource you almost never have in academia: a great team of developers at the top of their game, ready to take your core idea, wrap it up and integrate it into the running system quickly and efficiently. At Dynamic Yield, we can go from idea through design and staging to production in just a few days, and becoming part of that process has personally taught me a lot. As an example, please check out this free Bayesian A/B Testing Calculator which we conceived, researched, coded and designed in under a week! This advantage should not be under-appreciated: for me, this is the key selling point when making the choice between academic and startup style research. Of course, with great power comes great responsibility: being the single data scientist in a startup, it’s solely up to you to lead the machine learning effort – and that’s a huge challenge and responsibility.
As the resident scientist, you sometimes have to be the voice of sanity. “So you`re telling me there’s a chance” is what I often hear when I try to put expectations into perspective. Many times, I’ve had to give a reality check when it comes to what can or cannot be done, what could be expected from a predictive engine, how little data might be considered statistically significant and various other kinds of wishful thoughts. Whoever tells you their model can emit a good prediction from the moment of first conversion is either lying about their abilities or about having no prior data.
So, as you can see, startup and academic research are quite different in my view. This begs the question: should startup-style research be considered “research?” Frankly, I’m not sure, but here’s what I think: If your work involves combing through and assessing mountains of academic articles, then experimenting with merging your own dataset and ideas with the algorithms described, then research it is.
I want to finish on a somewhat-philosophical note. As much as I love machine learning and crunching big data, there is this part of me that feels it’s all a pity to some extent. Let me explain: as we’re entering the dawn of ML and BD (and clearly this is just the beginning), we are giving up our deep understanding of the underlying phenomena and processes in favor of our ability to predict future behaviors/events and make things more efficient. From Newton through Einstein to Schrodinger (pardon me for being a physicist at birth…), in the past few centuries, science has advanced humanity and taken the center stage away from religion as it brought two gifts to us humans: it brought us understanding of the examined domain (nature) and, with it, came predictive ability. These two gifts were intertwined and completely dependent of each other. Being a scientist at heart, it is understanding the underlying phenomena that exposes its beauty and makes the craft worthwhile.
Nowadays, a great ML expert might develop amazingly powerful ML algorithms, capable of things us humans fall short of. However, that’s still getting by with very little real understanding of the underlying phenomena. Deep understanding and predictive ability are decoupling as we speak. Even the term “data science” itself, if interpreted as “the science of data,” removes itself from the underlying phenomena. Unfortunately, many, if not most, non-scientists would give up understanding in exchange for predictive ability any day of the week. I think this sucks. It is my hope that this is just an intermediate phase, and that eventually human curiosity will catch up and leverage these amazing new tools for a deeper understanding of nature and us Homo sapiens.