Free on-demand webinar
So Gartner says that most of the AI projects we’re currently doing will deliver erroneous outcomes. Now what?
From a chatbot that is plain racist, to a systematic gender discrimination CV filtering solution, we’re starting to see more and more case studies about AI that goes wrong…
…and Gartner is saying that this is only the tip of the iceberg.
Why do your projects have a high chance of failing? And more importantly: what can you do TODAY to prevent it?
Model accuracy is not the only metric you should care about.
In this webinar, Olivier Blais, AI/ML Quality Evangelist here at Snitch AI will guide you through the reasons why ML/AI models fail and what are the best practices to validate your models.
For those of you who want’s to see the Gartner source, there you go.
Below you’ll find the webinar transcript. We’ve obviously used AI to generate it, so maybe you’ll see some incoherence. Let us know if you do! 🙂
All right, welcome. Welcome everyone. Thanks for thanks for joining us today on our first snitch AI webinar. We’re pretty, pretty excited. My name is Simon Shanks, I’m a product marketing manager here at Snitch AI.
We’ve got some, some really really good content for you guys. I know you worked hard on this presentation. We’re just going to wait a few few seconds here while while people are trickling in slowly. A few things on the webinars today it is it is recorded. So we will be sending the recording later on. Today you can rewatch it, you can share it with your your colleagues if you want. For those. For those that couldn’t make it today and that are watching in the in the future.
Feel free to send us any any questions by email after after watching the recording. For those that made it today, we will have a live q&a at the end of the presentation. So at the bottom of your zoom window, there’s a q&a section. So feel free to shoot any questions as they come through during the presentation. We will only be answering them at the end of the presentation, though. But we will say as for as long as you guys have questions, so feel free to shoot as many as you as you want. And perfect. So people seems to have trickled in. So let’s get let’s get started. The floor is yours.
Gartner says your ML models will fail.
Good. Hello, everybody. Thanks for being here. My name is Olivier Blais. So here is my self presentation. My name is Olivier Blais. I’m head of data and transformation at Moov AI. I work consulting company in data science. And I’m also a CPO at Snitch AI.
So I work with the snitch team in order to to develop a a, an AI quality evaluation tool. I’m also currently working with the the ISO standards through the Canadian Standards Council of Canada. So the standard console of Canada on quality evaluation. So we’re currently working on and making a project available on quality evaluation. And I’m also guest speaker talking about AI and what I can do in business contexts.
I’ve been working on several different projects in my in my other life and also through the last eight years. Excellent. So today, I think and this is this is like a question that we need to ask ourselves. what’s what’s happening with the model that we’re currently building? And we know that artificial intelligence is something that we starting, we’re starting to get familiar with. And two years ago, we were talking a lot about AI not being used, or you know, you do a project and then you you never deploy it. And but what what is the next step these days is something that’s even more drastic, I would say, personally, I was shocked to hear that most project.
Well, we’re not put in production. But now nowadays, it’s even worse because Gartner says that estimates that 85% of projects will deliver erroneous outcome. So here this is, in my mind, this is even worse, because it’s better to not have a tool than to have a tool that return return errors or inaccurate predictions. And what can we do to prevent this? So this is essentially what we’re going to talk about. This is something that’s really, really important to me and I hope this is also important to you to avoid doing these mistakes in the In a future project.
So the objectives a little bit like what was what was written in the invitation will stick to this? Because this is, those are our three valid objectives. So So first, we’ll talk a little bit about validation of your model:
Why is it important? We’ll touch on that a little bit. I hope that it’s going to resonate. And then we’ll go a little bit deeper into into the, the quality evaluation process.
What should we expect? What should a quality evaluation process look like?
How to implement such process in your in your team? I’ll try to not be too technical. But by nature, I’m a geek, so it might, it might be a little bit too deep. But if you have any questions, please feel free to ask later on, either in the question during the question, period, or in by email.
Why is my machine learning not performing well?
So usually, when I talk about machine learning, I’m usually referring to this, this picture, this is the previous picture, but this is a most accurate one.
And often, in an organization will will hear the data science team, they say, you know, big data, great model. And yeah, my model is performing so well. And even early in the project, that you just started, and the team already says, You know what, yeah, I’m I’m able to solve this problem, I’m able to do a correct prediction. And then it can be during the project lifecycle, or it can be after when the project is delivered, then you realize that it didn’t really work. It’s like the myth of the fisherman.
You think you have a big fish, but it’s like it’s a shoe, or it’s a tire. So it doesn’t work. And usually, it’s because there’s, it’s a complex project. So I’m not talking against the data science team. But it’s because it’s really come. It’s a really complex problem that we’re trying to solve. And usually, it’s, it’s easy to be misled by some of the of the problems that that might happen during the project.
Why does performance matter?
So let’s come back a little bit and talk about the performance, what does it matter? And here, I’m going to talk a little bit more high level. And why do we need to make sure that we’re, we’re delivering performance project or project of good quality. And so first of all, if it’s a little bit like an onion, it has several different benefits to several different stakeholders.
So at first, the project team is definitely going to benefit from a quality project. So it’s a it’s a streamline execution, because you if you you’re able to deliver quality, you don’t have to do a lot of redos or you don’t you don’t have to, to rush into fixing things when the project is delivered. So you make sure that you have a good project delivered, as per the requirements.
If you’re like, for instance, I’m the head of design transformation. So I have a lot of different projects ongoing. And I want to make sure that every one of my projects are as as good in terms of quality. And having a good is a good process. So things that we’re going to talk today, allow me to compare my projects, apples to apples, even if the the use case is different, the task is different. Something might be an image recognition project.
Another one might be NLP or tabular data doesn’t really care. Because you can you if you you’re able to quantify what’s of good quality and what’s not good quality, then you’re able to compare and you’re able to create threshold that are that are polyvalent. Despite that, the project type.
Then if we go outside the pocket, like I’m, like I said before, I’m a geek. So I know. I know that things might not go might go wrong. From time to time, I’m able to troubleshoot a little bit of the project because I’ve experienced, but what’s happening when we go outside this technical pocket, and to the stakeholders, the end users of the project, the sponsor of the projects, They need to know they need to feel confident about the project. And, and here, this is another example why.
And when you have those fisherman myths, and you lose, you lose confidence so fast, because it’s hard to trust the team. Because if you say that something is good, and then they’re able to point a very, very simple mistake, or very, something that’s really clear to them, and then you’re the data science team will lose a lot of credibility. And that’s it.
And you don’t, you clearly don’t want that. And then finally, you know, let’s go, even like the island, the highest level possible. And this is a little bit why I decided to participate with to the ISO norm, is because we need to think about the society, the about the end users, and having a have any project. So I think a an AI system deploy will generate benefits. And you want to make sure that the whatever is deployed, is good enough so that it create, it does good. By being precise by achieving the business case, being able to achieve the benefit that you’re looking for.
And this is something that’s important in society, because now we’re gaining adoption we’re getting, we’re helping democratizing the technology. And this is great because it creates a virtuous cycle. However, so that that’s great. And this is, this is really important to make sure that we have this in place, this type of process, but a lot can go wrong.
What can go wrong with my machine learning model?
So here, I’ll put everything into perspective, by going back and talk about, okay, what what’s a machine learning model what, what the system is built. So first, and this is going to be 101 really basic, making sure that everybody’s on the same page.
So, a system starts with data. So here we will talk about historical data. With this historical data, a machine learning algorithm is going to learn trends. So let’s keep it as simple as possible. It learns trends on the data that has been observed. And here, we then validate the model performance, the model performance is essentially our good can these trends be transposed into a prediction on new data.
If you’re able to predict very well, it means that your trends are very precise. And based on these trends, you’re able to predict new new data points. And this is essentially here, what I’m going to say. So when you have this model in place, the way you operationalize the model is by adding new data, you then observe the trends in the model that has been trained. And then based on those trends, you make those predictions. It sounds very simple.
Bad training data
But you know, a common common problem would be when you have bad training data, for instance, you garbage in garbage out is a classic, is a classic that we see in data science. So you need to start with good data. You relevant muddled trends. So here for instance, and I don’t know if you have heard of overfitting, overfitting, it’s the equivalent of learning something by heart.
So you try to learn every word so precisely, you forget about the overall message that you’re trying to to learn.
Incorrect performance assessment
And this is exactly the same that can happen to to a machine learning problem. And the trends might not be generalizable on new data, because they’re irrelevant. The the if a business person or if a subject matter expert, look into those trends, the adult and they understand very, very quickly that this won’t work. Also, here, you might think that you have a great drug problem, that you have a great solution to a problem. But if you incorrectly access your your solution, then it won’t work.
An example could be if, for instance, you have a you have a machine learning on something called unbalanced data. So for instance, there Cancer prediction, you typically have in tiny, tiny ratio of people who has, who have cancer, and a majority of people who don’t have cancer. And this is this is great, by the way, we we want to have this type of imbalance when we talk about cancer.
But the problem is that if you think that your model is 99.9% accurate, just because it predicts everybody doesn’t ask cancer, and your your your assessment is your the performance has been assessed poorly, and it doesn’t work.
When we look at new data, it might be the new data might be unstable might be different. Here we talk often about something we see out of sample data. So it’s data that has been that maybe trends shifted, maybe there are some errors, technical errors, and it generates the data that is just not similar to what is that it has been trained on. So the gap is too, too far.
And because of that we’re not able to predict correctly, on the business side, you might not be able to understand the the the subject matter expert or the end users might not be able to understand correctly what’s going on. And it might need to find since it ticklers security errors, it’s it’s best of all, we never know, if we’re not able to understand what are the most important variables explaining a prediction.
And then finally, here, even if your model is very good, data is good. And if you you as you have miscalibrated, your predictions, so you’re not able to act properly under prediction, and then it all so you’re the benefit that you think you’re achieving might just be might just not be there anymore. And this is why it’s important as well to understand correctly, what a good prediction is and what a bad prediction is. And know that that’s a lot. But it’s just the tip of the iceberg, there’s so much that can go wrong in a machine learning project.
You need to control the quality evaluation process
So here are to say that we need to control the quality evaluation process, we need to have a tight process in place. And we need to make sure that this this process is reflecting the quality that you’re trying to achieve. I’m going to explain what it means. So when we talk about the process, it’s usually going to be those four different types of activities that you’ll need to make it to put in place.
Identify required evaluation activities
So first, you need to understand what are the major activities that you do because in the project lifecycle, the quality valuation will not be continuous per se, it’s not something that runs every hour, like something in production, you need to plan some time in your project to validate the quality and avoid adding Bad, bad quality for for a long time. And then having to go back and change everything about the project. So you need to have different activities at different time.
Determine measures & measurements methods
You need to determine measure and measurements. This is pretty hard, and we’ll talk about that later. But we’ll just I’ll just mention some, I’ll guide you through some of the measures and measurement. But we have another webinar in three weeks on the 16th. And we’ll go further into this direction. Okay, so be reassured that we’ll talk about this a little bit later.
Ensure comprehension of output by stakeholders
You need to ensure comprehension of the output and the stakeholders will want to be reassured will want to gain confidence. So it cannot just be a JSON with results and then the data science team being able to code everything and and troubleshoot using the the output of a quality evaluation tool or process.
It needs to be reflected upon the stakeholders so that they’re able to be to understand that everything went through it’s a good it’s a good project and we can go forward and deploy it. And finally, now that you have the the the right process, you can standardize it and embedded into your operations.
Establish a standard quality evaluation process
So what type of activities will we need to do? So if we think about the execution phases. So usually a project will start with the project kickoff. And you know, by all mean, the depending on your company, your type of activities, sometimes the names might change a little bit. But I think the phases are pretty standard, don’t we start with a kickoff, we understand the vision, we understand what the problem that we’re trying to solve.
At that stage, we’ll need to start evaluating already the data bias. In one, one of the first activity that we’ll need to do is to look at the data that is available, and ask ourselves the question, if the data is good enough to do anything about it, okay, if you’re not able to do something with this data, because it’s too biased, it’s too, the data is too messy, we need to raise the flag.
There are also other risks that we’ll need to evaluate at that stage can be the complexity, it can be too complex to be able to predict, it can be too sensitive, and it might be very hard to adapt. So here are different types of risks that we’ll need to highlight very early on. Then when we have identified what the project like, then it will fall into a proof of concept mode. So we’ll start we’ll start doing a epoch.
And our goal will be to demonstrate that it works, we can do something about it, we can resolve this the situation. So So here what we’ll need to do, it’s something we call functional testing. And this is being able to, to evaluate the suitability of the project. So and I’ll talk about it a little bit later.
But suitability it’s essentially is AI necessary, do Can I gain, can I gain anything by using complex, a complex model, or, or I can use that basic statistics and get the same outcome. This type of this type of testing is also necessary at this stage. Because if we decide to go further, it means that the approach, the approach that has been accepted, is generating additional benefits. And this is what’s going to be implemented. So when will, we’re in implementation mode.
So what it means is, okay, we learn from the proof of concept, then we’re making it available. So we pass from, we’re starting with a simple model, an end to end model. And now we’re deploying a, an AI system, from A to Z. And when we do this, this is where we need to have like a full overall quality evaluation. And this need to get gate prior to deployment. So a best practice, even though you say, you know what, we’ll just do a pilot, even if you think about making the outcome available to anybody, you need to have everything validated thoroughly. Because you will lose, you will lose credibility otherwise.
And then when it’s deployed, we’re falling into continuous improvement mode. And then we need to monitor the performance. So so here are two things that we’ll need to monitor is that adrift? So is the data evolving too much. And then there are so many trends, that change, meaning that we need to retrain the model, and is the model degrading. So if the model is not able to do any good prediction anymore, we’ll also need to do something about it. Excellent.
So if we look at any one of those activities. So what we need to do at that stage is, and this is here, this is my micro. So what we need, we need to have model and data or if you’re just validating the data bias, for instance, you might just need to have a data set and and best practices that we’ll talk later on, and we’ll talk also in three weeks. And with this, then false the quality evaluation process that we that we’re talking about today. And here is a it generates an answer to the critical question, is the performance acceptable or is the quality acceptable Sorry about that, this should be quite this, this should be quality here. If the quality is acceptable, so send me send the scenario.
You did something Right, and you then are able to push it into production to deploy your model. Otherwise, so you will need to, to so so if it’s not acceptable, you need to understand why this generates, by the way, this portion and for its normal, you’ll always have some fix. So it’s part of the normal process. If the problem if the problem is hard, you will have to improve quality a couple of times before you can deploy it. And this is a good news, it means that you prevented, but then you didn’t get dystrophy, if you would have deployed a model. And, and here, this creates a lot of learnings in your organization.
It’s important document those learnings, document the root cause leading to to a quality issue. So that you can, you can try to fix it practically next time. And then when you do this, you fix it. So you’re improving quality. And then it triggers another quality evaluation process. This is how it should be done. And I know it’s it can be painful. But this is if you want to have quality in delivered, this is all you need to do.
How to determine machine learning evaluation measures?
Now let’s talk a little bit about measurement. Because this is not easy. So when we talk about measuring quality waste, we need to think about the definition of quality. And this is this is very hard, because it’s not just a software per se, this is a very complex solution. And, and we need to so those are four different characteristics that can be that can be translated into into quality.
So here I’ll explain those a little bit. But this is, this is just a portion of the different quality dimension, we’re able to extract a lot more dimension of the quality system. But those in my mind are the most critical one to assess.
What is suitability in machine learning?
And first of all suitability is your project. So is your project, a good project? Are you generating the benefit that you’re looking for? So this is a very, very legitimate question that you can can ask yourself here.
Reliability of an artificial intelligence system
Reliability… even if you seem if you’re having a good project, if you have a good system, if your system doesn’t work on different data, if it doesn’t generalize very well, or if it if the performance drop every time, there’s just a little little change in the data set. And it’s a very reliable, so it’s going to be a very, it’s going to be a challenge to, to to have a the benefit that you’re looking for. Here, you need to make sure that your your system is more robust.
Discrimination in machine learning
So this is also another very important dimension : discrimination.
By the way, when I’m using the word discrimination, we need to understand in that sense that discrimination is is normal. This is essentially why machine learning model works. It’s because there are some some variables that you can use to discriminate towards a certain goal. So this is this is a normal to have to have discrimination in the project. But you need to understand what that the discrimination type, you need to understand what’s inside the black box.
Because that you don’t want to fall into the the the ethical risks and the the arm for discrimination. That can happen. It can happen in a lot of different settings. They happen in banks, when they started to look at the zip code. And then some zip code were referred were a proxy to look at the ethnicity of the person. You don’t say ethnicity. So you don’t have a column in your data that says ethnicity. But the machine learning is so powerful that it’s able to do these, these deduction. And even if it’s not, if it’s not explicit, it’s harmful for a for the population and you need to be able to see this and act upon it.
Maintainability of an artificial intelligence system
And then finally, maintainability maintainability, it’s a little bit what we talked before being able to monitor your your your model and your system, sort of Because it has been deployed. And then when it’s when it’s monitored, making sure that you have a project that you’re about to maintain well, so when the data is shifting, you’re able to catch it, you’re able to retrain it so that it’s, it’s always refreshed and always up to date, without necessarily having to manually intervene. And also making sure that the performance is not just dropping, without any, without any human intervention or any intervention.
Okay, something needs to happen. Machine learning is not as stable as, for instance, a software development solution, even a software development solution. If you have a website, and there are new version of browsers, you need to validate against those new versions, ie, so even a simple software solution need to be maintained.
This is more even more critical for for a data science project, because the data evolve at an alarming rate, and you need to act on it. And this is not easy to see those shifts. So you need to have the proper tools to be able to get there. So if we go back, and talk about suitability. So here, I’ll just give you an examples. So when you ever eat when you’re trying to put a proof of concept together, so how do we how do you validate that?
How to measure suitability of a machine learning or artificial intelligence system?
Yes, this is the right use case for machine learning for artificial intelligence. So first, you want to test you want to do some tests and see what the performance based on the metric. And here for your information I’m using. I’m using the plural because we we should never have just one metric. And I’ll talk about it, I’ll touch on this in two minutes. But it’s a bunch of metrics that you need to look to validate the quality of your model.
So first of all, you need to ask yourself, okay, if I just randomly select prediction, yes, no, yes. No. What’s my, what’s my hoods, so So here are a good good example is, head or head or tail, for instance, head or tail, it’s 50% chance. So you flip the coin, then you have 50% chance of getting a head and or getting a tail.
Here, if you’re trying to do even the most sophisticated project, you’ll get to the same odds are you even get worse odds. It’s but believe it or not, it’s possible to even be worse than a random state. It’s because you’re trying so much to learn something that you cannot because it’s completely random. And so here, this is for you first good this, this is a send to check, am I better than random state, this is like the lowest bar ever.
Then after that, the right next step will be to test with a dummy model. The dummy model could be yesterday’s value, if you’re trying to forecast in the future, it could be the median, or the the average prediction, and so on, so forth. So there are different ways or different statistics that you can use. Okay, I’m always going to use the the average of the last week to predict something, and it might bring good performance, it might be very good, it might be very good, good and so simple to implement. So this is something that you you should not, you should not skip.
And actually the dummy model is the minimum threshold. So usually, if you’re able to to be the minimum threshold, then you see that, okay, it’s a good thing. So it means that I am using machine learning is making a difference is helping me getting where I need to go. But even then, there are a lot of different models. And you can have a simple model, or you can have the most complex model, and there’s no limit, and we see it with like GPT3, Bert, and they are no limits in terms of complexity.
The The only limit currently it’s the it’s the processing time and the data set. So what do when do we need to stop and hear that we see it it’s by looking at the return So when you go when you go step by step, so what you’ll see is that you probably end this is not a rule, because it will depend on your project, but you will gain and that’s the point.
So when you start getting more complex model, you’ll see that you’ll plateau, it’s normal, because there’s a ceiling at some point, like a theoretical ceiling, it depending on the, because it cannot, it can never be perfect data and like a data will always be a little bit imperfect. And your, you will never be able to get through 100% of that data. And so this is why you’re never able, you’ll never be able to achieve the the absolute perfection, but you’ll get closer to the perfection. And then when it’s too complex, it will start decreasing.
And it can decrease a lot. And and why I’m saying this, it’s because not only can you overfit or you can have bad performance metric, but also when you start looking at all those other metrics. So the robustness, the more complex or model, the less robust it will be, because you’ll you’ll have to maintain a very complex solution. And you can have your inference be adding a small model will be very snappy and very quick to predict something. And a complex model will be heavy and will take time to to make a prediction.
So it can be a showstopper for you if you need to have your decision in a couple of seconds. And you need to take that into consideration. When you look at your performance.
How to measure reliability in machine learning?
If we talk about reliability, reliability, it’s all about playing with yourself. So if you’re if you have some data and you do a a model, how robust is your model?
This is a good question is because if you only look at your training data set, and you never look at the at the data, the live data, you might be very surprised. And you can add some random noise. And I know you change your systems and every every data, so you everyday direct changes a little bit. Oh, so so that that create a problem. And you might have new scenarios here.
I don’t want to search on what happened and then months ago, but the COVID-19 was catastrophic, but also for machine learning. It made most machine learning models unusable. Why, because they were not ready to deal with such such scenarios, we’ve never experienced something like this. And even I’m not saying that model should because I think this one was really drastic. But models, you could get closer, or you could reduce the performance drop. By creating new scenarios, you add noise to your to your model. Extreme noise, you could even think about hacking your models, right now we start to see this with adversarial examples.
This is a way people can hack your model and you can ask your model to NC always perform. And finally, also looking at the ends or uncertainty, you might have a good performance. But if you have a very high level of uncertainty, as soon as you deploy, then you’re it was already done certain. So you are you might be in trouble later on. We’ll we’ll talk also about model discrimination for for a second here. And this is a slide that’s quite hard. And you know what i did it on purpose because being able to, to see the model contribution or the feature contribution to dimauro.
It’s never easy. And you need to find an easy way to do that. And a it’s not true that you can do this. And it’s not true that deep learning cannot. So it’s true that deep learning is a is a black box and you cannot try to explore the world are the biggest driver to your model. It’s a true anymore. By the way, you can do this there are some very useful tool and very tool that has been validated that you can use right away.
Okay, so this is an example of measuring discrimination So for instance, if we look at the first, the first variable, so this is about chest pain. And what it says here is that if you have a high chest pain, or a low Jesper, low value of that variable, you can see that this is really discriminant. And so every time, there’s a every time there’s a high value in this in this variable, so the first variable, it means that the model will be will be reducing, or you will predict a zero. And then when you have a low value in this variable, you tend towards a higher one.
So if it’s binary, so you predict that something’s going to happen, or not. Okay, so this is really discriminant. And here, it gives me a lot of information. So for instance, we can see here and agender, if we go down, this is not as discriminant, because they’re near zero, so my line in the middle, it says zero, so it’s near zero, but what it says is that when the patient is a male, you tend to add to be towards zero.
And so this melting is helping the model get better. And here, asking a subject matter expert will be a good thing. And if the subject matter expert, look at every one of those variables, and agrees it’s, it’s a great sign, it means that your model as trend that are seem to be useful.
Then when we talk about maintainability, in here, this is a good a good quote from a towards data sense, then,
This is this is true, it’s you’re trying to beat the clock with your with your model, because the world is evolving. And model, it’s something that you train at one point, and then you predict on it. And then you retrain it, but you’re all you have always retrained.
So you’re always you’re always in the past. And this is why it’s so important to make sure that your trends are robust, and that you’re able to, to maintain easily your model. Okay, this is not true that your model is evolving in real time. It’s true, if you’re starting to talk about reinforcement learning, but even reinforcement learning, it’s never a case by case, it’s going to take a high level of it will take a lot of observation to be able to adapt properly, the prediction.
How to mitigate the inevitable data drift
So here, when we talk about that adrift, like your model has been trained on something that’s so clean, it’s like living at your, at your parents, and you’re you’re used to like this very simple life. And then when you’re when you’re by yourself, you see that, Oh, I should have stayed home at my parents because it was I was protected against like all the nature’s element. And here you’re by yourself.
So you see key changes in your systems, new trends COVID-19 financial crisis, and can be, it can be a shift that happened over time. When there’s new technology, when there’s when the app the iPhone, iPad, it took some time for the trend in electronics to change. Because it takes time until the critical mass is getting access to the technology.
Or there are some overnight trends. And when the so if you have if you’re you’re adding so in COVID for instance, sometime logistics legislation can prevent you from going to a restaurant or to a store. And it’s it’s happening overnight. And it completely changes your pattern not only for the for the restaurant in the store, but if you’re an internet service provider, it will change the way people are using your your your service because people are at home.
So it’s happening overnight, you need to be able to act on it. So now that you have a and by the way, this is just a bunch of examples. So now that we talked about the process, and I hope that you put this in place where we talked about some some Measure some dimension that you need to look into.
You need to ensure stakeholders comprehension of your machine learning model output
Now, what’s happening? So, so first you need to develop output that is useful to, to your stakeholders. And one key that we’ve seen at some, some best practice that we’ve seen at the snap gi.
First, it’s very useful to create a global quality metric. It’s that it’s that a must do. But this is something that’s very useful because you’re able to compare two versions against each other. So being able to create that global quality metric is super useful. But also, and it’s being able to break down your your metric into into sub categories.
It’s also important. So if you look at the add the image, so let’s see, here we have 67, as far as 67 means that I don’t know depending on your, on your description, but here, it could be, you know, what we suggest you to, to put in prod but to monitor or to review the to review to review the other dimension of your of the quality of your model. And here would be the different different section with a red, yellow, green, yellow, red.
So here, even though you have a 67, but you have some opportunities for improvement that have been identified, and then it’s more granular, so people will be able to act on it more easily. But it’s important to keep this simple, simple for your stakeholders. So you know, your deal, you have to deal with two sides of the coin, you want to give as much guidance to your development team. But at the same time, you need to keep it simple to your, your stakeholders.
So here this is, and this is a prime. And this is why what I would suggest you is to have a business output for your stakeholders and also to have a more concrete set of recommendation for your development team. And please make sure that the quality evaluation process should not become a troubleshooting process.
You need to use this with care when you have a new version, a new final version that you want to validate, you cannot just say you know what I’ll just desk in loop my model until something works. And so it needs to, it needs to be a last step when you when you’re ready to put in production. So to recap, quality, quality evaluation, this is something that’s that’s critical, and where when the far with right now, things are not stable, you look online, and you are, you look online, and you still see some people who are suggesting you to, to to produce a machine learning model, and not even use a validation or test that asset, it’s still there, you can use reuse the code, and then you miss the step where you need to validate your model.
And then you think you have like a very good solution. It’s a far west, you need to be able to structure your practice. And this is how you gain you get to a successful deployment, okay, and you’ll be able to, to have a succession of good projects under your belt. It’s, again, it’s not just one minute when performance metric, I see this all the time, you have a classification problem, you’re deciding to use AUC, so you know, area under the curve, because accuracy isn’t necessarily the proper metric.
And you think this dot this is enough for you to, to see, to have a clear view if your model is of good quality or not. But this is not sufficient. Even if you start getting there, you need to think more holistically about the quality of your system. So what’s happening if you’re a data exchange, what’s happening on the long run, what’s happening, if some situations are not looked into what’s happening if the the the end users don’t understand the main drivers of your model, and etc.
So this needs to happen. And then finally, even if you have the Best evaluation process, you need to give insights to your your stakeholders, because we’re, we’re still far from a democratized technology, we still need to, to get do to get there. And the way you can get there is by reassuring your stakeholders and demonstrate that your project is, is in good care. And that you can expect the return to be positive, because you have validated without the need. So it’s never, it’s never a perfect validation.
But you made a good enough valid evaluation, to be confident about the results. And the and then the the the end user, if you’re using their language to demonstrate this and to share this message, we’ll understand that they’re in good care, and that they can address the model that you’re, you’re about to deploy. So this is, this is it for me, I would like to thank you very much.
And I’m available. I’m all yours for questions. Perfect. Thank
Thank you, Olivier. Very, very insightful, as as always. So we’ll, we’ll open up for the the questions. We don’t have any right now. So if you want to write any questions at the bottom of your, your application, your zoom application, you see a section called q&a, you can simply write your, your questions there, and we’ll take them one by one, I did receive a few chats from people asking us this is going to be recorded. It is it was recorded, and we will send it to you after. So you can send off your your questions and we’ll take them as as they come.
We’ll just wait for them to come in. There’s none right now. But in the meanwhile, I’ll do a bit of a shameless plug, we have another webinar, as Avi mentioned, in three weeks. So this in this one, we’re going to go a bit deeper into the validation framework that we that we have.
So like have you talk how to measure discrimination measure measuring reliability. So introducing noise into your model, for example, how to detect drift. So we’ll go into a bit more, a bit more details about that model and look into snitch AI a bit more as well.
Perfect. So we have some some questions for you Guillaume is asking : I’m curious why why you see that validation is not troubleshooting. You explained it a bit. But I want to know more about the difference between the two
It depends on the on who, who is impacted by it, and troubleshooting something that will be so the troubleshooting is a necessary activity throughout the development. And this is, so the objective is a little bit different for troubleshooting, it’s trying to have the best solution. And quality evaluation, it’s making sure that the solution you’re deploying is of good quality. So if you’re waiting until until the end, to vet to troubleshoot, there’s a problem, you need to make sure you have the right tools to help you deploy developing your model along the way.
So that when you you fall into the quality evaluation process, you already have a good and valid solution that that is ready to deploy. You know, it’s a little bit painful. But this type of gates is important. It’s super important. So you have one model, you validate it, and then it can go into deployment phase. It’s critical, it’s a little bit of same thing. So it comes back to validation that I said versus testing data. So when you want to validate it, when you want to validate your model along the way, you’re using the validation data set, this is this is great.
You can you test it all the time, sometime you randomize it a little bit to make sure that you you’re you’re not using always the same data. But at the end, what you’ll do your use your test data set. This is exactly the same thing. We need to actually be here.
Perfect, thanks. Thanks. So we’ve got another one from Arjun : when looking for overfitting, what are some tools used by data scientists to check for this? Can you only find out after the deployment when it this is a great thing?
For overfitting, there are different ways you can do this. And, in fact, what’s interesting with overfit is that as complex as it is, you can, you can find smoke, “il n’y a pas de fumée sans feu.” , when there’s a fire, there’s smoke.
And you’re able to find this by looking at a lot of different things. And when you use a separate data set, ie the test that so for instance, that you have not used before, it should be able to help you out. But this is there’s even something better, because sometimes the data can be you can have like duplicates in your data set and everything. So you can, if you’re trying to use a interpretability technique, you’ll be able to see that the main drivers or domain, the main, the main trends that you have identified with your with your model are incorrect.
And this is pretty difficult. So the first and the main features are don’t make any sense to the subject matter experts. So it should ring a bell. And another one and this one, we’ll talk. We’ll also talk about this later. But if you add just a little bit of noise into your data set, and it makes a big difference, a big drop in your performance. Now usually, there’s also a problem there. And it means that there’s probably overfit involved in your in your in your model. So those are ways you can get can get there.
And there are also other types of analysis. Like if you’re doing deep learning, looking at the, at the history, by epoch, usually you’ll be able to see the learning curves. And But anyway, so there are a lot of different ways to be able to, to to validate this. And I agree with you, though, but because I suspect this is this is your your opinion, this is not easy when you’re just looking at the quality measures, or the performance measures that are non today, to validate to identify is if there’s a little bit of overfit or not. And but when you’re looking outside dose, it’s easy to split.
Unknown Speaker 57:43
Awesome, if you didn’t notice our accent, have you made it clear that we’re from Quebec by dropping a bit of French there? Perfect. We got one last question. Actually another one that just came in from André. The main drivers are always derived by the models?
Everything depends here, if it’s a philosophical or, or if it’s a, it’s a discussion about the machine learning models. So you can see philosophical in the sense that you might want to identify the main drivers upfront, and only use those, but with deep learning, it’s a little bit less. So that then the trend right now the industry is to go further from, we call it feature engineering.
So usually feature engineering mean that you’re finding those drivers, and then these, the only the best drivers are used in your model. And now with deep learning, it allows to, you’re able to deal with nice more easily. So, people what they will do, they will instead of work on the features, it will take everything and then they will work on the parameters instead to be able to to be more up to be optimal, then.
nd so here in other words, the trends in the industry is that the model will find the most impactful variables. And then here this is why it’s important for validation, foreign interpretability technique. It’s because these technique will will show you to what the importance of those variables. And then you can be either you can be in agreement or not with the model and this is where it started being really interesting.
Even if you’re even if you you found any cherry picked the variable based on the importance It’s possible, again, to piggyback on the last Martin’s question, if you overfit even if you think those three variables should be the most important. But if you overfit, it’s possible that those will not be in the important section like will not contribute to the prediction. Because there’s an there’s a problem that you need to resolve before.
Great, we’ve got one last question here from from Simon, how can I help my model? How can I know my model will perform on new data in production?
Though, and this is a good question. in production. So first, from a process standpoint, what you would need to do is, you would need to validate that your model that your production data is similar to your training data. And you know, it’s sounds stupid, but this is a major problem, I’ve been working in the FinTech company before. And one of my problem is that in when you look at the trend, the financial transactions, so can be so the the expenses can be a negative number, or it can be positive, but in the passive column. And here, what happened to me, and there was like a big problem, might, my training data was transformed.
So if the numbers were negative, if they were expense, or positive, if there were revenues, but in real in real life, and in production, the numbers were positive, if there they were, revenues are expense, that so it doesn’t work. So you need to do this first. If then you’ve updated it, and you’re okay. Still, you need to validate that your model is robust enough. And your The, the, the way you can validate is if you add a little bit of noise, if you you’re trying to, to play with the data, you create new data set, you can create test cases, and, and and see how it reacts. So if you add a little bit of noise, and your model is not, is still performing well, then it looks promising.
It means that even in production if your data changes a little bit, and by all means it’s important for your model to be able to react on new data, this is why you have this you have an AI system in place is to evaluate new new data set. So and but that’s all you’ll be able to, to get there.
And I’ve a feeling if if you’re okay with with your your model prior to the deployment, because you don’t want to test blindly a new AI system in production.
Perfect. Well, this ends the question, period. So we want to thank everyone for for joining us today. If there’s any more questions that came up that you want to ask us, feel free to send it to us. We’re more than happy to answer them. But that’s it. That’s it for us today. And again, thanks for thanks for joining us.
Thank you very much