Start to test

Bring order to the wild west: Start testing machine learning models  6 min read

Machine Learning (ML) is a revolutionary tool that many organizations are working tirelessly to adopt into their businesses. In these initial forays into ML over the past few years, we’ve developed techniques that vastly outclass algorithms that took decades to develop. The potential seems unlimited but at the same time, we’ve barely scratched the surface of what will be possible in the coming years.

A challenge that most engineering departments face when adopting machine learning however is one of trust. Trust that the model is indeed doing what it should and that the model can consistently provide the outcomes that are expected of it.

Machine learning brings new challenges

When coding classical algorithms, the logic is laid out precisely and can be interpreted by any programmer with sufficient knowledge of the language and the business domain. One can trace through the algorithm and at each branch, follow the decisions that the algorithm has taken to produce the final output.

With ML however, there is no code per se, and the model cannot easily be interpreted to trace a decision. One has to trust that the model has sufficiently “understood” the use case through the training data to properly interpret the inputs and generate the correct outputs.

« 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms or the teams responsible. »

– Gartner

Since a human cannot explicitly validate the inner workings of the model, this leads to a lack of confidence. The problem is so widespread that, according to Gartner, in 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms or the teams responsible.

This is a significant issue that merits utilizing standard, commonly approved techniques and approaches to minimize risk and maximize success of projects.

The new kind of “Wild West”

Much like software engineering in the 90s, machine learning is a young, burgeoning field, filled with the chaos and “wild west” so common with emerging technologies.

With this chaos comes many varying approaches and techniques, each data scientist using their own approach when developing models without much of a defined engineering practice to guide development and ensure a minimum level of quality.

Although there exist various techniques and approaches to measure the quality of ML models, many of them are not widely adopted and most are altogether unknown to the majority of the community.

Accuracy for the win… or not.

The most common metric used by data scientists is accuracy: the ability for the model to generate an accurate prediction within a defined test dataset.

Despite the fact that accuracy is a great metric, it’s only part of the picture. It suffers from being limited to a specific dataset, a snapshot of a single moment in time. Accuracy can also be susceptible to various human errors such as labelling errors or data leakage, as well training errors, such as overfitting.

Accuracy as a single metric of the quality of a model is thus woefully limited. Add to that the fact that machine learning models tend to be incredibly hard to interpret black boxes and it becomes increasingly difficult to pinpoint exact shortcomings in a model.

In short, it’s very difficult to identify why things went wrong when they do.

What is model robustness and why should you care?

Over the past few years, there has been a growing movement to increase model robustness and interpretability.

Model robustness is defined by Investopedia as the following: a model is considered to be robust if its output dependent variable (label) is consistently accurate even if one or more of the input independent variables (features) or assumptions are drastically changed due to unforeseen circumstances.

In essence, how well will the model fare when confronted with real-world data, which can be different and evolve over time from the data the model was trained with.

This can include changes in data shape due to evolving real-world circumstances or can even be caused by malicious intent from a model consumer, such as someone trying to “hack” or “game” the model.

Interpretability to manage biases

As for interpretability, understanding how each independent variable contributes to the overall prediction is useful in many cases. It’s as much important to understand if the model is predicting as it should (using a human to double-check the prediction) or to understand if the underlying training data, and therefore the model, has any strong biases.

Biases in themselves are not necessarily undesirable as they simply indicate a strong correlation between an independent variable and the output. However, it can be an issue if said bias is along ethnographic, political or gender lines which should in theory have no impact on the model’s output.

If the training data was generated by human activity, it may contain the implicit biases of those humans and the model will simply learn to repeat them. One can clearly see the importance of understanding well how these biases impact the model to make an informed decision on whether to use the model for business operations or not.

Model robustness: a standard to generate trust

The overall challenge of ML model robustness is gaining traction as standard bodies, such as ISO SC42, currently have working groups to address this concern. Through collaboration with members of the ISO consortium, the working group “AI trustworthiness” aims to produce guidelines that can be used to address and alleviate the issues that I’ve laid out in the previous paragraphs.

While it may take time for an official standard to emerge, one thing is certain: this issue is critical to generating trust in machine learning models as well as the long-term viability of ML as a tool to automate business decisions.

Getting started with model robustness

Until then, what are your options for ensuring machine learning model robustness?

A good starting point is to create a standard process for your machine learning development pipeline. Ensure that your data scientists understand it well and adhere to it. You can then start integrating various quality assurance elements to the pipeline.

Peer reviews are a great way to catch easily overlooked mistakes that can have large repercussions and have the added bonus of spreading experience and domain knowledge. Various robustness tests can also be added to your pipeline, such as tests for biases, sensitivity to noise, data drift and others.

You should also define a monitoring policy for models that are deployed to production and periodically review them to ensure that they are still operating as expected.

There are also a number of tools and products that can be used to further assist you by automating many of these approaches, freeing time for your data scientists to focus on what they are best at: creating models that generate value for your organization.

By leveraging the expertise of these tools and products, you can ensure that your quality assurance stays relevant and evolves concurrently as the science of model robustness does.

At the end of the day, it all boils down to one thing: how confident are you that your ML models are doing what they are supposed to do? How do you ensure that your organization delivers robust and high-quality AI?

Christian Mérat

Christian Mérat

Christian is Snitch AI's Chief Operating Officer. He's really good at turning a product’s vision into reality. He can overcome any technical obstacle using his innate ability to leverage the right technology to meet business needs. In the last decade, he strongly contributed to the commercial success of Sharegate and Officevibe thanks to his strategic vision and cutting-edge technological expertise.

Read more

Book a demo!

Get Started with Snitch AI

Learn everything about your model blackboxes.

Submit this form to request access.