One of the hottest topics in Software today is AI. And if you’re anything like me, it’s safe to assume you’re keeping up on all the recent trends in its development. A few years ago, you may have even seen a piece come out of The Netflix Tech Blog, where the streaming service discussed how it personalizes artwork when recommending content to each user based on their personal viewing history.
I found it particularly interesting, primarily as the algorithm described, and the whole workflow around it was strikingly similar to what Dynamic Yield used in our product not long ago. I even gave a talk about our approach to the field of Contextual Bandits – which belong to a niche sub-domain of ML and is only rarely discussed.
But as it turns out, the algorithm was speculated to have been behind some racial targeting, and more recently, the internet lit up with criticism over African Americans being served alternate movie posters showing only black actors. Especially egregious, as people have noted, is the fact that the actors featured did not receive any significant screen time in the movie itself.
In the wake of the controversy, Netflix went on record claiming, “We don’t ask members for their race, gender, or ethnicity so we cannot use this information to personalize their individual Netflix experience. The only information we use is a member’s viewing history.”
Isn’t that distinction mostly semantic, though?
During the rest of this post, I’ll dive into the wider underlying issue here and why those in the industry need to be more responsible when it comes to designing future algorithms.
The inconvenient, convenient truth
To many people, Machine Learning is clouded in an aura of mystery. I typically see the topic being portrayed as some mythical new force – its inner workings and boundaries unknown. This image benefits companies, who now promote their technology as a mysterious, complex marvel. It also allows them to dodge controversies with pretty ambiguous responses, much like the case here with Netflix. But this isn’t the first time Machine Learning and skin color has become an issue of debate.
In fact, models for image labeling and classification trained on huge datasets are now being offered as-a-service. And it turns out these services are pretty darn accurate when classifying skin color and gender. Except when it comes to black females.
Lower detection rate for darker females by Amazon Rekognition
In August of 2018, researchers found Amazon’s Rekognition AI software had much higher error rates when predicting the gender of darker-skinned women in images, compared to that of lighter-skinned men.
So what’s the reason for this weakness, and why should we be concerned?
One simple cause may be selection bias – algorithms trained on a few images of black females as opposed to many images of white people or males. In a situation such as this, an algorithm would usually tend to make only “safe bets,” selecting what it has seen more frequently. In others, it can be hard to obtain enough samples per each type we wish to classify in order to get it right. Alternatively, maybe the algorithm struggled due to lighting conditions or low dynamic range in the source images, which can’t yet be handled well. However, it probably had more to do with the dataset chosen as the input for the model.
But as facial recognition technology increasingly gets adopted by government agencies and law enforcement, it begs the question of what such inaccuracies may lead to.
Let’s take this a bit further…
Imagine you are an engineer in charge of training a new autonomous vehicle. It’s not hard to conceive you’d likely teach it in more affluent areas, usually dominated by white people. In such a case, the vehicle presumably wouldn’t be as good at detecting non-white pedestrians. I’m not trying to point fingers or make wild guesses – the truth is that it’s still human to make mistakes and teach the computer to do so as well.
I’d like to focus now on our responsibilities as practitioners in the field and the challenges we’re facing, even when doing our work with the best of intentions.
Doing our best, yet still introducing errors
Let’s consider Netflix again – based on their technical blog, it’s easy to conclude there was no super-human artificial intelligence at work here. Rather, it seemed like a manual process whereby videos in the content library were tagged with a set of labels which someone thought were meaningful for optimization. We don’t know whether those labels were created around the theme of “Has Black Actors,” or what exactly was on that list but a team of graphic designers was then tasked with creating different variations of movie posters to match these labels. This enabled the algorithm to match people’s past viewing habits (represented as labels) with their tendency to click on a specific artwork variation.
Does that match Netflix’s response?
Technically, Netflix is correct in stating it uses viewing preferences rather than trying to determine the race of the users themselves, and I don’t attribute them with any ill will. However, I do feel their response left out a slightly inconvenient fact, which is that intelligent design was still at work here (please pardon the pun). Similarly, as it pertains to Amazon’s gender/race detection, we do not know exactly what led to a higher rate of darker females being misclassified by algorithms. There’s a good chance selection bias was unknown to the algorithm creators, perhaps made worse by other technical difficulties.
One thing is clear to me, though. Whenever we’re charged with creating such models and deciding what input to feed them with, we have a responsibility to take bias seriously. It’s not just about getting valid statistical or business results, but also keeping ethics in mind. The general public may have the notion of AI as a black box emitting unforeseen outputs, and admittedly, there probably is a growing amount of truth to it. However, we are still affecting the outputs of the AI to a large degree, and if we find we’ve unintentionally made errors, we should clearly explain what went wrong and how we’re gonna fix it, regardless of how inconvenient it may be.
Bias and misconceptions are older than the current AI hype cycle and much has been written about core issues with the infamous p-value calculation, which is the basis for classic A/B Testing. (See the term p-hacking and Evan Miller’s writeup on “How Not To Run an A/B Test.)
The problem does not even squarely lie in the math itself, but with how easy and tempting it is to misuse and misinterpret it. One of the alternative approaches is switching to Bayesian A/B Testing, and at Dynamic Yield, we’ve done just that. We believe this method provides results simply better adapted for humans, and over time, have added similar protections around our product. Some of the potential pitfalls we found ahead of time, but often we had to learn things the hard way. And as we now move from A/B Testing towards Deep Learning-based Personalization, there are again issues of a new scale.
Tackling bias in a less interpretable age
Deep neural networks, by design, typically make it significantly harder to detect which specific input features had the most influence on the output. Unlike the machine learning methods of yesterday, such as Random Forest or Logistic Regression, extracting the significant features is now a much bigger challenge. As Idan Michaeli, our former Chief Data Scientist once said, “Deep understanding and predictive ability are decoupling as we speak.” This is no small thing, and arguably, it is at odds with the former method of making scientific progress.
Methods of gaining interpretability for Deep Learning models do exist. While I am not personally qualified enough to comment on them, I do see a major opposing trend that is quickly gaining momentum. Transfer Learning offers great benefits yet pushes us again towards less interpretability.
To explain this term, let’s assume I want the computer to distinguish between a few different types of trees. If no model exists to do exactly that, it would be very hard to get a few hundred, if not thousands of images for each tree type. Under these conditions, crafting my own model from scratch would no doubt be resource-intensive and may not lead to great accuracy.
Instead, what if the visual features were extracted by an existing veteran model which had already been trained by millions of images? These images can be fed into the current model and output features transferred into a new model that’s very simple and fast. Even though the original model had no labels for these tree-types, it’s way better at recognizing patterns than a young, ignorant network would be. Researchers have demonstrated this method can often achieve similar or better results than fresh deep models, and for a fraction of the time and money. You might say it’s the algorithmic equivalent of standing on the shoulders of giants.
Now, imagine transfer learning evolves into a huge marketplace. In this market, we can buy features for our images (or whatnot) from vast pre-trained commercial deep networks. Companies with deep pockets would build them while the rest of us could easily use the outputs for a small cost, feeding them into our much simpler models. In such a setting, we’re back to square one on interpretability, as we may not know what inputs contributed to these features and whether the supplier has acted with ethical due diligence.
AI has already taken over more of decision-making previously reserved for humans, and this process will most likely accelerate. Many are still in the dark as to what exactly is going on here, the level of insight human operators actually have, and whether we can trust the designers of future algorithms to do their best to eliminate bias. However hard we try, new and fresh mistakes are just waiting to be made, and I certainly hope Netflix, Amazon, and others can at least admit and act on their errors without hiding behind the fog of AI hype. I believe some amount of humility will go a long way towards tackling these concerns and we’re certainly following that spirit at Dynamic Yield.