Enhancing airport operations through data science readiness solutions

inform blog

THE PILLARS OF DATA SCIENCE READINESS

Nov 12, 2019 Max Uppenkamp

Success stories about the benefits of machine learning and data science prompt companies from all industries to adopt some level of automated data analysis for themselves. In doing so they often encounter problems early on, or, even worse, late into the process. In this article I will give you an overview of the main pillars of Data Science readiness, as well as how you can achieve them.

Data Availability

No matter the method or goal, the basis for any machine learning endeavor is data. Any machine learning expert (or hobbyist for that matter) will tell you: "The more data the better." While this is generally true, the concept is more complex than it might appear.

Suppose you would like to predict how long an order takes to arrive at its destination. A basis for this project could be a two-dimensional, coherent table of historical order data. Data science convention says that table rows represent so-called instances, which in our example would be a single order. The table columns contain all the available information on each order, e.g. value, amount, destination or time to arrival. The concept of "more data" applies to both dimensions: We want a maximum number of past orders to learn from (rows) and a maximum amount of information on each order (columns). Accordingly, the first goal in creating the data basis should be to maximize both dimensions.

In most real-world scenarios, the useful data is scattered across several database systems, Excel spreadsheets, Emails and other semi-adequate storage solutions. In some cases, the data might not even originate from your own infrastructure, for example weather and traffic data. As a result, a significant amount of work goes into consolidating and cross-referencing these data sources to form a coherent representation. This process itself requires a certain skillset, which brings us to our next pillar:

Data Literacy

Data Literacy is the ability to read, comprehend and analyze data. On first glance, "reading" data might sound trivial, but often accessing it means jumping over significant hurdles. Take a piece of machinery that allows data export in certain intervals, or an outdated piece of software, that exports data exclusively in a proprietary format. In these cases, it is necessary to devise individual data extraction strategies even before taking stock of the available data.

Speaking of taking stock: Frequently there is no single person with an exhaustive knowledge of the available data, let alone comprehensive documentation. Instead this knowledge is divided up between different business divisions or even database admins. It should be your first step to identify the people in charge of the different data silos. In an ideal world you would then put the respective tenants into a room and let them figure out the following domain-specific questions:

What data is available?
How is it connected?
Which data is valid?
What is outdated or otherwise irrelevant?

Unfortunately, the real world rarely allows this kind of cooperation due to time constraints or even interpersonal conflicts. It's a good idea to have the data scientist(s) be a part of the process (e.g. in workshops) to reduce the number of questions he or she will have to ask later on. Once these questions are answered, you should have a good idea of the content and size of your base data.

The steps I just laid out form the basis for a solid data strategy. If implemented properly and updated continuously, this puts you in a good position to initiate data science projects.

Data Quality

At this point, you could hand your data over to your data scientist of choice and let them figure out the rest. Usually data science projects start out with a workshop or two, followed by a Proof-Of-Concept pilot and then a review . Data quality issues tend to arise in this early phase, which is to be expected and prevents them from derailing the project at a later date. Nevertheless I would like to give you a quick run-through of the different aspects of data quality, and what you can do to get ahead.

Consistency

A consistent data format is a prerequisite for automated data processing in general, even more so in data science applications. Often data is collected and entered by numerous employees which usually leads to spelling variations, different formatting preferences or other inconsistencies. These deviations can be detected up to a certain point, however correcting them early on is generally the better approach.

Accuracy

The meaning of accuracy in data science is two-fold. On the one hand, accuracy denotes the precision with which data is collected and persisted. On the other hand, accuracy also means that data is entered, processed and stored correctly.

Take an engineer who logs the parameters of his or her machine for example. He or she might round the temperature value to the nearest decimal, which should be accurate enough in any case, but still introduces inaccuracy. But he or she might also measure correctly but make a typo when entering his or her measurement.

While both sources of inaccuracies cannot be completely avoided it's beneficial to at least know where they exist, and how often they occur. This information is invaluable to discern actual data outliers from so-called "noise", the expected inaccuracy.

Completeness

Missing data is one of the biggest hinderances in data science projects. In a worst-case scenario, any "row" of the data that has even one entry missing has to be dismissed entirely. This can decimate huge datasets to a point where machine learning is no longer possible.

Composition

With all prior pitfalls avoided, dataset composition is unfortunately the hardest one to detect and fix. I think it's best explained by example:

A big credit bank would like to train machine learning models, so called regressors, to predict their returns for certain investments. To that end they have meticulously collected and polished a dataset with millions of past investments. By nature of the task they also have a reliable ground-truth, meaning the actual performance of each investment. They set a certain amount of their data aside for testing purposes, and train machine learning models on the rest. During testing the models perform well, but when rolled-out the predictions are far from reliable. What happened here?

In academic terms, the model is not as generalized as assumed. What this means is that the training data does not accurately represent the real world and therefore the model fails when it encounters "new" situations.

Dataset composition also plays a role in non-regression tasks, such as self-driving capabilities in cars. To make sensible decisions at all times, the dataset has to sufficiently cover every common and uncommon driving situation. To put it bluntly: Learning based on millions of highway miles will not help your Tesla navigate a 5-way intersection.

Expectations

Lastly, I want to address the way companies approach data science. Now and then we are approached by customers who want to get into data science and even have consolidated data but have no concrete idea what they want to achieve. This idea of "here's our data, see what you can do" can work, but having a clear objective from the start usually results in a much better dataset.

In addition to a clear objective, companies should have realistic expectations when starting out with data science. While machine learning is a great tool for finding patterns and regularities that would be otherwise to intricate to discern, it is not a magic wand.

About our Expert

Max Uppenkamp

Max Uppenkamp has been a Data Scientist at INFORM since 2019. After previously working in Natural Language Processing and Text Mining, he is now engaged in the machine-learning-supported optimization of processes.
In addition to accompanying customer projects, he translates the knowledge gained into practice-oriented products and solutions.

Cookie	Description	Lifetime	Domain
cookieConsent	This cookie saves your cookie preferences for this website. You can change these or withdraw your consent easily.	1 month	.inform-software.com
cookieConsentAccepted	This cookie saves your cookie preferences for this website. You can change these or withdraw your consent easily.	1 month	.inform-software.com
Neos_Session	This cookie saves your cookie preferences for this website. You can change these or withdraw your consent easily.	Session	.inform-software.com
msd365mkttrs	This Cookie is used for recognizing CRM contacts when subscribing to newsletters.	Session	.inform-software.com
WYSIWYG_AB_TESTING	Cookie for saving AB-Testing information	1 year	.inform-software.com
__cf_bm	Necessary to support Cloudflare Bot Management	30 minutes	.vimeo.com
LanguageCode	Saving Language settings	3 month	.inform-software.com

Cookie	Description	Lifetime	Domain
_ga	Registers a unique ID for a website visitor it tracks how the visitor uses the website. The data is used for statistics (Google LLC)	2 years	.inform-software.com
_ga_*	Registers a unique ID for a website visitor it tracks how the visitor uses the website. The data is used for statistics (Google LLC)	2 years	.inform-software.com
UserMatchHistory	This cookie is used to record visitors' behavior on the website	1 month	.linkedin.com
AnalyticsSyncHistory	Store and track visits across websites.	1 month	.linkedin.com

Cookie	Description	Lifetime	Domain
li_gc	This is a cookie from LinkedIn and is used for storing visitors' consent regarding the use of cookies for non-essential purposes	6 months	.linkedin.com
VISITOR_INFO1_LIVE	This cookie allows Youtube to check for bandwidth usage	6 months	.youtube.com
vuid	This is a cookie from Vimeo used for the video player on our website	2 years	.vimeo.com

Cookie	Description	Lifetime	Domain
msd365mkttr	Cookie for long-term behavioral analysis. The cookie does not contain any personal information, but it uniquely identifies a particular browser on a particular computer, and Dynamics 365 Marketing can use it to correlate that ID with an actual contact in the Dynamics 365 Marketing database	2 years	.inform-software.com
_fbp	This cookie is used by Facebook for advertising purposes and conversion tracking (Meta Inc).	3 months	.inform-software.com
_gcl_au	This cookie is set by Google Adsense for experiments with 'cross-website' advertising.	3 months	.inform-software.com
bcookie	Cookie from LinkedIn used by share buttons and advertising tags	1 year	.linkedin.com
bscookie	Cookie from LinkedIn used by share buttons and advertising tags	1 year	.linkedin.com
li_sugr	Cookie from LinkedIn used by share buttons and advertising tags.	3 month	.linkedin.com
lidc	Cookie from LinkedIn used by share buttons and advertising tags.	1 day	.linkedin.com
YSC	Registers a unique ID to keep statistics of what videos from YouTube the user has seen	Session	.youtube.com