THE PILLARS OF DATA SCIENCE READINESS
11/12/2019 Max Uppenkamp
(c) Just_Super - Getty Images
Success stories about the benefits of machine learning and data science prompt companies from all industries to adopt some level of automated data analysis for themselves. In doing so they often encounter problems early on, or, even worse, late into the process. In this article I will give you an overview of the main pillars of Data Science readiness, as well as how you can achieve them.
No matter the method or goal, the basis for any machine learning endeavor is data. Any machine learning expert (or hobbyist for that matter) will tell you: "The more data the better." While this is generally true, the concept is more complex than it might appear.
Suppose you would like to predict how long an order takes to arrive at its destination. A basis for this project could be a two-dimensional, coherent table of historical order data. Data science convention says that table rows represent so-called instances, which in our example would be a single order. The table columns contain all the available information on each order, e.g. value, amount, destination or time to arrival. The concept of "more data" applies to both dimensions: We want a maximum number of past orders to learn from (rows) and a maximum amount of information on each order (columns). Accordingly, the first goal in creating the data basis should be to maximize both dimensions.
In most real-world scenarios, the useful data is scattered across several database systems, Excel spreadsheets, Emails and other semi-adequate storage solutions. In some cases, the data might not even originate from your own infrastructure, for example weather and traffic data. As a result, a significant amount of work goes into consolidating and cross-referencing these data sources to form a coherent representation. This process itself requires a certain skillset, which brings us to our next pillar:
Data Literacy is the ability to read, comprehend and analyze data. On first glance, "reading" data might sound trivial, but often accessing it means jumping over significant hurdles. Take a piece of machinery that allows data export in certain intervals, or an outdated piece of software, that exports data exclusively in a proprietary format. In these cases, it is necessary to devise individual data extraction strategies even before taking stock of the available data.
Speaking of taking stock: Frequently there is no single person with an exhaustive knowledge of the available data, let alone comprehensive documentation. Instead this knowledge is divided up between different business divisions or even database admins. It should be your first step to identify the people in charge of the different data silos. In an ideal world you would then put the respective tenants into a room and let them figure out the following domain-specific questions:
- What data is available?
- How is it connected?
- Which data is valid?
- What is outdated or otherwise irrelevant?
Unfortunately, the real world rarely allows this kind of cooperation due to time constraints or even interpersonal conflicts. It's a good idea to have the data scientist(s) be a part of the process (e.g. in workshops) to reduce the number of questions he or she will have to ask later on. Once these questions are answered, you should have a good idea of the content and size of your base data.
The steps I just laid out form the basis for a solid data strategy. If implemented properly and updated continuously, this puts you in a good position to initiate data science projects.
At this point, you could hand your data over to your data scientist of choice and let them figure out the rest. Usually data science projects start out with a workshop or two, followed by a Proof-Of-Concept pilot and then a review . Data quality issues tend to arise in this early phase, which is to be expected and prevents them from derailing the project at a later date. Nevertheless I would like to give you a quick run-through of the different aspects of data quality, and what you can do to get ahead.
A consistent data format is a prerequisite for automated data processing in general, even more so in data science applications. Often data is collected and entered by numerous employees which usually leads to spelling variations, different formatting preferences or other inconsistencies. These deviations can be detected up to a certain point, however correcting them early on is generally the better approach.
The meaning of accuracy in data science is two-fold. On the one hand, accuracy denotes the precision with which data is collected and persisted. On the other hand, accuracy also means that data is entered, processed and stored correctly.
Take an engineer who logs the parameters of his or her machine for example. He or she might round the temperature value to the nearest decimal, which should be accurate enough in any case, but still introduces inaccuracy. But he or she might also measure correctly but make a typo when entering his or her measurement.
While both sources of inaccuracies cannot be completely avoided it's beneficial to at least know where they exist, and how often they occur. This information is invaluable to discern actual data outliers from so-called "noise", the expected inaccuracy.
Missing data is one of the biggest hinderances in data science projects. In a worst-case scenario, any "row" of the data that has even one entry missing has to be dismissed entirely. This can decimate huge datasets to a point where machine learning is no longer possible.
With all prior pitfalls avoided, dataset composition is unfortunately the hardest one to detect and fix. I think it's best explained by example:
A big credit bank would like to train machine learning models, so called regressors, to predict their returns for certain investments. To that end they have meticulously collected and polished a dataset with millions of past investments. By nature of the task they also have a reliable ground-truth, meaning the actual performance of each investment. They set a certain amount of their data aside for testing purposes, and train machine learning models on the rest. During testing the models perform well, but when rolled-out the predictions are far from reliable. What happened here?
In academic terms, the model is not as generalized as assumed. What this means is that the training data does not accurately represent the real world and therefore the model fails when it encounters "new" situations.
Dataset composition also plays a role in non-regression tasks, such as self-driving capabilities in cars. To make sensible decisions at all times, the dataset has to sufficiently cover every common and uncommon driving situation. To put it bluntly: Learning based on millions of highway miles will not help your Tesla navigate a 5-way intersection.
Lastly, I want to address the way companies approach data science. Now and then we are approached by customers who want to get into data science and even have consolidated data but have no concrete idea what they want to achieve. This idea of "here's our data, see what you can do" can work, but having a clear objective from the start usually results in a much better dataset.
In addition to a clear objective, companies should have realistic expectations when starting out with data science. While machine learning is a great tool for finding patterns and regularities that would be otherwise to intricate to discern, it is not a magic wand.
About our Expert
Max Uppenkamp has been a Data Scientist at INFORM since 2019. After previously working in Natural Language Processing and Text Mining, he is now engaged in the machine-learning-supported optimization of processes.
In addition to accompanying customer projects, he translates the knowledge gained into practice-oriented products and solutions.