WHY DATA SCIENCE PROJECTS FAIL
09/16/2019 Björn Heinen
Pieces of jigsaw puzzle and global network concept.
In this article, I will discuss some of the most common reasons for the failure of data science, machine learning, and artificial intelligence projects. While I could not possibly list every reason a project may fail, from my experience these are some of the main reasons why data science projects go wrong.
1. The Data
Let us start with the data quality and quantity. These are the most obvious reasons for a project to go wrong but contrary to common perception, they are not the worst. Almost all companies store data but the type of data and how well they store it differs. When problems have arisen during my previous projects, it was rarely because the necessary data did not exist (workshops are held during project initiation to prevent that from happening). Occasionally there were difficulties because the quality was insufficient (but that is what filter functions and data scientists are for). The actual problems were mostly due to the fact that the data was not available in the expected form.
The only copies of Emails available were scanned printouts. Machine data was not obtained directly from the modern machine and continuously transferred to a database but instead read from the display and manually entered into Excel spreadsheets (or written on paper). Fraud labels, which were supposed to indicate whether an online purchase was a fraud attempt, only contained information as to whether the very first invoice had been paid or not. And so on.
The heart of the problem lies in the fact that some employees in the company are very well aware of the form in which the data is available and that it is not optimal but they are usually not the decision-maker who wants to drive the machine learning project forward.
2. The Price
For some machine learning projects, the price is not an issue. The data situation is good, the problem clear and manageable, so the solution is quickly developed and integrated. The project costs are manageable and pay for themselves quickly thanks to the added value.
Meanwhile, there are projects that easily create six or seven-digit added value. They are often more exciting but at the same time also much more complex, time-consuming and therefore more expensive. The issue is not the cost/benefit calculation - which is usually great. The issue is that there is no mathematical guarantee at the beginning of a data science project that the project will be successful. The question whether the data is sufficient for a good machine learning model can only be answered with certainty once all the data has been consolidated and a minimum of feature engineering and data cleansing has been carried out. How could you evaluate a complete pipeline without setting it up and executing it at least once?
This proof-of-concept process always offers added value in itself for every company: Knowledge about the data situation, data quality and business processes including deviations from previous assumptions can be gained, as can recommendations for future action. But many decision-makers don’t even view this path as a possibility because it does not directly and immediately solve the problem at hand. Regardless of the insights gained, there must be certainty from the outset that the goal will be achieved. The corporate culture in this respect - especially in Germany - is still very risk averse.
3 The Data Scientist
As more and more companies understand the value of their data, data scientists have become a rare commodity. Their salary levels have of course grown accordingly. If we now add the Harvard Business Review, which has declared Data Scientist the “Sexiest Job of the 21st Century”, plus the hype that accompanies the topic anyway, it is no wonder that there are more and more people who slap the term “Data Scientist” on their business card.
The catch is that the supply of “real” data scientists, who have sound knowledge of stochastics, machine learning algorithms, software engineering, evaluation methodology and the like is developing much slower than the market of those who, after a few online courses and reading on the subject, believe that they have the necessary skills for all applications. Please don't misunderstand me: Data science knowledge can very well be acquired autodidactically. It is only much more difficult than is often believed and practiced. The big and unique problem is that it is very easy to practice machine learning incorrectly and at the same time it is very difficult for outsiders to judge whether or not the work of a data scientist was technically correct.
Particular care must e.g. be taken to ensure the correct evaluation of the results. Training and test data sets must be selected correctly, taking into account a number of parameters. Are there concept drifts? Is my sample representative? Am I using features in the same way I would be using them in a live setting? A data scientist who is not fully aware of such issues will not (be able to) properly evaluate his or her model during training and will then be surprised that it performs so differently in production.
4. The Acceptance
In many industries, explainability and traceability are so essential that the greatest machine learning model will not be used if no one understands how it comes to its conclusions. For some problems, for example in the financial industry, traceability is even required by law. Data-driven decisions are only made with models that can be directly understood by humans. For this, the power of differentiation and precision are usually sacrificed but hey, you can’t have your cake and eat it too.
At this point I would also like to highlight those scenarios where traceability is not really necessary but in which a few false forecasts lead to irreversible mistrust. It usually goes like this: There is an expert user who has been doing the task in question for decades. The new machine learning model yields (on average) strongly more precise values than what has been used before. However, every now and then “the black box” makes a mistake. This mistake combined with the exclamation “I told you so” is then used as an excuse to stick with the traditional way of handling the task using the less precise values or estimates. Because with them, at least we know where they came from.
5. The Understanding
Data Science, data mining, machine learning, artificial intelligence, ... These are all terms that have been around for a long time but have only attracted widespread attention in the last few years. This results in expectations and understanding not always matching reality in data science projects. Randall Munroe has taken up part of this problem very well in one of his comics: xkcd.com/1425/.
How much effort is involved in different development steps is often just as misjudged as the actual result of these steps. A short anecdote: We were tasked by a customer to develop an anomaly detection system for a sewer system, i.e. to the answer the question: Does anything happen in our sewer system that is outside the norm and to which we should react? After a successful proof of concept, we presented the client with the results of the algorithm, which we were quite sure were relevant anomalies. Contrary to our expectations, the response to the results did not focus on their quality. Instead we were asked “why is it not clickable?”. A simple business intelligence application interactively visualizing the output of the algorithm dispelled all concerns and we could talk about the accuracy of the algorithm and the next steps. Just talking about numbers, data and results does not help you if the users cannot grasp what they are seeing.
6. The Commitment
Parallel to the price, the importance of commitment within the company also depends on the complexity of the problem to be solved. Small projects involving only a handful of employees do not require special commitments, the solution is simply developed and put into operation. On the other hand, in larger projects, for example, several departments must be involved, infrastructure created, technical backlogs eliminated, processes changed, and management brought on board. If a large data science project starts from a single department but does not receive support from other departments (regardless of whether they would also benefit from it) and particularly from management, solvable problems become unsolvable problems. One way to make most of these commitment issues disappear is to have a proper data strategy established throughout the company. Another way is to start with the available toolset, develop Proof-of-Concepts and use these as an internal marketing vehicle for the next steps.
About our Expert
Björn Heinen has worked at INFORM since 2017 in data science. As Lead Data Scientist, he deals with both internal projects, in which existing INFORM products are extended by Machine Learning functionalities, and external projects, which he manages from development to implementation and integration.